This is a tutorial that guides you, step by step, how to write syntax extension using Camlp4 in Objective CAML. The example we present in this tutorial should give you a practical idea how syntax extension works, but this is not meant to be comprehensive. There are many resources on the web for Camlp4 prior to version 3.10, but Camlp4 3.10 introduces incompatible changes. This tutorial targets Camlp4 3.10 only.
Throughout the tutorial, we intentionally introduce subtle bugs in the interest of simplicity, but these bugs will be later explained.
The reader is expected to be familiar with context-free parsing. Some experience with yacc or ocamlyacc will be helpful. Knowing the difference between LL, LR and LALR parsers is a plus.
One of the greatest strengths of Objective CAML is Camlp4, a modular pre-processor pretty-printer that lets you add syntactic sugar to the language without rewriting the entire grammar from scratch. A syntactic sugar is a language construct that can be decomposed to an existing construct. For example, O'Caml doesn't have a "for-each" syntax, which can be illustrated by the following Python code:
a_list = ["hello", "world"] for s in a_list: print s
This code outputs:
hello world
However, that doesn't mean O'Caml can't support iterators! As it turns out, many modules in O'Caml that define a data structure also provide an "iter" higher-order function that works similarly:
let a_list = ["hello"; "world"] in List.iter (fun s -> print_endline s) a_list
Besides List
, other modules that provide "iter" are Array
, Hashtbl
, Map.Make(_)
, Queue
, Set.Make(_)
, Stack
, and even String
. Wouldn't it be nice if one can write:
let a_list = ["hello"; "world"] in for s in a_list do print_endline s done
instead? Imagine a long for-each body; this style of writing puts the "each" variable and the list next to each other, which results in cleaner code.
The idea is to transform the latter code into former, leveraging the "iter" function provided by these modules.
Notice that we can't generally look up a module specific "iter" function by the type of the expression we want to iterate over, so we have to specify a module like this:
let a_list = ["hello"; "world"] in for s in List a_list do print_endline s done
More generally, the syntactic sugar takes the following form:
for v in M e do seq done
where module M implements an iterator over the elements in collection e; most of the time, e also has the type M.t
, though this is not strictly required.
Without going into too much detail, we can write a simple extension as the following:
open Camlp4.PreCast open Syntax EXTEND Gram expr: LEVEL "top" [ [ "for"; v = a_LIDENT; "in"; m = a_UIDENT; e = expr; "do"; seq = sequence; "done" -> <:expr< $uid:m$.iter (fun $lid:v$ -> $seq$) $e$ >> ] ] ; END
All the prerequisites are provided by the module Camlp4.PreCast
, including the Syntax
module that provides the terminals a_LIDENT
and a_UIDENT
(for lowercase and uppercase identifiers respectively), and the non-terminals expr
(expressions) and sequence
(semicolon separated expressions). The Gram
module provides functions to manipulate non-terminal rules.
Suppose we save this syntax extension in the file called pa_foreach.ml
. Notice that a syntax extension is just a regular O'Caml file with some fancy EXTEND
syntax. It can be compiled using the command:
ocamlc -I +camlp4 camlp4lib.cma -pp camlp4orf -c pa_foreach.ml
camlp4orf
. See Using Camlp4 for the explanation and additional options.We can verify that it works:
$ ocaml Objective Caml version 3.10.0 # #load "camlp4o.cma";; Camlp4 Parsing version 3.10.0 # #load "pa_foreach.cmo";; # for s in List ["hello"; "world"] do print_endline s done;; hello world - : unit = () #
Now we have a rudimentary syntax extension. However, it is quite restrictive. For example, it would be nice to do pattern matching like this:
for (k, v) in List [(0, "hello"); (1, "world")] do Printf.printf "%d: %s\n" k v done
And it doesn't support Hashtbl.iter
yet because, so far, we can only generate functions with one argument. Hashtbl.iter
passes two arguments to the function, the key and the value. We will address these issues in the next section.
We're now beginning to use Camlp4 features that are less documented. It is useful to keep in mind that syntax extension is modifying existing parsing rules, and we can't do without an understanding of how existing rules work. It is recommended that you take a quick look at two files in the O'Caml source distribution under ocaml-3.10.0/camlp4/Camlp4Parsers
:
Camlp4OCamlRevisedParser.ml
; this provides the bulk of the syntax rules.Camlp4OCamlParser.ml
implicitly borrows most of its rules from Camlp4OCamlRevisedParser
, but clears some of the rules that are particular to the revised syntax and replaces them with the original O'Caml syntax.From these files, we can see that ipatt
rule supplies what we want for matching patterns, and LIST1
modifier allows us to take one or more patterns. There is also a LIST0
modifier for "zero or more" matching.
open Camlp4.PreCast open Syntax let rec mkfun _loc patts e = match patts with | p :: patts -> <:expr< fun $p$ -> $mkfun _loc patts e$ >> | [] -> e EXTEND Gram expr: LEVEL "top" [ [ "for"; patts = LIST1 ipatt; "in"; m = a_UIDENT; e = expr; "do"; seq = sequence; "done" -> let f = mkfun _loc patts seq in <:expr< $uid:m$.iter $f$ $e$ >> ] ] ; END
In order to generate a higher-order function according to a list of patterns, we use the helper mkfun
that essentially generates, for patterns patt1
... pattN
,
fun patt1 -> ... fun pattN -> seq
which is the desugared form for fun patt1 ... pattN -> seq
.
We can verify that pattern works:
$ ocamlc -I +camlp4 camlp4lib.cma -pp camlp4orf -c pa_foreach.ml $ ocaml Objective Caml version 3.10.0 # #load "camlp4o.cma";; Camlp4 Parsing version 3.10.0 # #load "pa_foreach.cmo";; # for (k, v) in List [(0, "hello"); (1, "world")] do Printf.printf "%d: %s\n" k v done;; 0: hello 1: world - : unit = ()
and that multiple arguments work:
# let tbl = Hashtbl.create 3;; val tbl : ('_a, '_b) Hashtbl.t = <abstr> # Hashtbl.add tbl 0 "hello";; - : unit = () # Hashtbl.add tbl 1 "world";; - : unit = () # Hashtbl.add tbl 2 "foobar";; - : unit = () # for k v in Hashtbl tbl do Printf.printf "%d: %s\n" k v done;; 0: hello 1: world 2: foobar - : unit = ()
There are still some subtle problems with this approach. In the next section, we'll discuss solutions to these problems.
The first problem we find out is that foreach body is supposed to be a sequence, but that doesn't work:
# for s in List ["hello"; "world"] do print_endline s; () done;; ;; Failure: "expr; expr: not allowed here, use do {...} or [|...|] to surround them"
The problem is due to the desugared form, which becomes:
List.iter (fun s -> print_endline s; ()) ["hello"; "world"]
and this is ambiguous. Do we interpret the expression fun s -> print_endline s; ()
as (fun s -> print_endline s); ()
or fun s -> (print_endline (); ())
? In order to avoid ambiguity, we wrap the sequence using do {$seq$}
when it is a sequence. The function mksequence'
in Camlp4OCamlRevisedParser.ml
provides some inspiration how to handle this.
We also notice that the original for-loop syntax ceases working:
# for i = 0 to 5 do print_int i done;; Parse error: [ipatt] or [a_LIDENT] expected after "for" (in [expr])
It turns out that the new rule we add for "for" doesn't play nice with the original rule. The solution is to rewrite the original rule so it becomes compatible with the new rule.
The code listing that addresses these issues can be found below:
open Camlp4.PreCast open Syntax let mksequence _loc e = match e with | <:expr< $_$; $_$ >> as e -> <:expr< do {$e$} >> | e -> e let rec mkfun _loc patts e = match patts with | p :: patts -> <:expr< fun $p$ -> $mkfun _loc patts e$ >> | [] -> mksequence _loc e let lident_of_patt p = match p with | <:patt< $lid:i$ >> -> i | _ -> invalid_arg "lident_of_patt" DELETE_RULE Gram expr: "for"; a_LIDENT; "="; sequence; direction_flag; sequence; "do"; do_sequence END EXTEND Gram expr: LEVEL "top" [ [ "for"; i = ipatt; "="; e1 = sequence; df = direction_flag; e2 = sequence; "do"; seq = do_sequence -> <:expr< for $lident_of_patt i$ = $mksequence _loc e1$ $to:df$ $mksequence _loc e2$ do { $seq$ } >> | "for"; p = ipatt; patts = LIST0 ipatt; "in"; m = a_UIDENT; e = expr; "do"; seq = do_sequence -> let patts = p :: patts in let f = mkfun _loc patts seq in <:expr< $uid:m$.iter $f$ $e$ >> ] ] ; END
In order for alternative rules to "fall through" correctly, rules must share a common prefix, and the rest of the non-terminals must be distinct enough. The way we structured the "for" syntax before, we essentially have the following two rules:
EXTEND Gram expr: LEVEL "top" [ [ "for"; patts = LIST1 ipatt; "in"; m = a_UIDENT; e = expr; "do"; seq = do_sequence -> (* ... *) | "for"; i = a_LIDENT; "="; e1 = sequence; df = direction_flag; e2 = sequence; "do"; seq = do_sequence -> (* ... *) ] ] ; END
Once the "for" keyword is matched, the parser proceeds to match the next token, but ipatt
also matches a_LIDENT
(a lowercase identifier is also a legal pattern). At this point, the parser fixes on a rule (whichever comes first) and no longer backtracks to the alternative option.
The solution is to make the original for loop always match ipatt
, but refuse anything except a lowercase identifier.
We can now verify that everything works as expected.
$ ocamlc -I +camlp4 camlp4lib.cma -pp camlp4orf -c pa_foreach.ml $ ocaml Objective Caml version 3.10.0 # #load "camlp4o.cma";; Camlp4 Parsing version 3.10.0 # #load "pa_foreach.cmo";; # let tbl = Hashtbl.create 3;; val tbl : ('_a, '_b) Hashtbl.t = <abstr> # Hashtbl.add tbl 0 "hello";; - : unit = () # Hashtbl.add tbl 1 "world";; - : unit = () # Hashtbl.add tbl 2 "foo";; - : unit = () # for k v in Hashtbl tbl do Printf.printf "%d: %s\n" k v done;; 0: hello 1: world 2: foo - : unit = () # for i = 0 to 5 do print_int i done;; 012345- : unit = () # for k, v in List [0, "hello"; 1, "world"; 2, "foo"] do Printf.printf "%d: %s\n" k v; () done;; 0: hello 1: world 2: foo - : unit = ()
We begin with a simple Camlp4 syntax extension for the for-each iterator syntax, then explored further refinements to that idea that works around parsing issues both in embedded code and in the rules.
This tutorial is written by Likai Liu. The text has not been peer reviewed, so use this at your own risk. Questions and comments are welcome.
Major revisions: