Camlp4 3.10/Foreach Tutorial

Camlp4 3.10 Foreach Tutorial

This is a tutorial that guides you, step by step, how to write syntax extension using Camlp4 in Objective CAML. The example we present in this tutorial should give you a practical idea how syntax extension works, but this is not meant to be comprehensive. There are many resources on the web for Camlp4 prior to version 3.10, but Camlp4 3.10 introduces incompatible changes. This tutorial targets Camlp4 3.10 only.

Throughout the tutorial, we intentionally introduce subtle bugs in the interest of simplicity, but these bugs will be later explained.

The reader is expected to be familiar with context-free parsing. Some experience with yacc or ocamlyacc will be helpful. Knowing the difference between LL, LR and LALR parsers is a plus.

Motivation

One of the greatest strengths of Objective CAML is Camlp4, a modular pre-processor pretty-printer that lets you add syntactic sugar to the language without rewriting the entire grammar from scratch. A syntactic sugar is a language construct that can be decomposed to an existing construct. For example, O'Caml doesn't have a "for-each" syntax, which can be illustrated by the following Python code:

a_list = ["hello", "world"]
for s in a_list:
  print s

This code outputs:

hello
world

However, that doesn't mean O'Caml can't support iterators! As it turns out, many modules in O'Caml that define a data structure also provide an "iter" higher-order function that works similarly:

let a_list = ["hello"; "world"] in
List.iter (fun s -> print_endline s) a_list

Besides List, other modules that provide "iter" are Array, Hashtbl, Map.Make(_), Queue, Set.Make(_), Stack, and even String. Wouldn't it be nice if one can write:

let a_list = ["hello"; "world"] in
for s in a_list do
  print_endline s
done

instead? Imagine a long for-each body; this style of writing puts the "each" variable and the list next to each other, which results in cleaner code.

The idea is to transform the latter code into former, leveraging the "iter" function provided by these modules.

Notice that we can't generally look up a module specific "iter" function by the type of the expression we want to iterate over, so we have to specify a module like this:

let a_list = ["hello"; "world"] in
for s in List a_list do
  print_endline s
done

More generally, the syntactic sugar takes the following form:

for v in M e do
  seq
done

where module M implements an iterator over the elements in collection e; most of the time, e also has the type M.t, though this is not strictly required.

Towards the First Try

Without going into too much detail, we can write a simple extension as the following:

open Camlp4.PreCast
open Syntax

EXTEND Gram
  expr: LEVEL "top"
    [ [ "for"; v = a_LIDENT; "in"; m = a_UIDENT; e = expr; "do";
        seq = sequence; "done" ->
          <:expr< $uid:m$.iter (fun $lid:v$ -> $seq$) $e$ >>
      ] ]
  ;
END

All the prerequisites are provided by the module Camlp4.PreCast, including the Syntax module that provides the terminals a_LIDENT and a_UIDENT (for lowercase and uppercase identifiers respectively), and the non-terminals expr (expressions) and sequence (semicolon separated expressions). The Gram module provides functions to manipulate non-terminal rules.

Suppose we save this syntax extension in the file called pa_foreach.ml. Notice that a syntax extension is just a regular O'Caml file with some fancy EXTEND syntax. It can be compiled using the command:

ocamlc -I +camlp4 camlp4lib.cma -pp camlp4orf -c pa_foreach.ml

FYI: Camlp4 is complicated to learn because a syntax extension consists of two domain-specific languages that are foreign to O'Caml: parsing and code embedding. To add insult to injury, Camlp4 has a "revised syntax" dialect of O'Caml that can be used, in addition to the original O'Caml dialect, both as the host (top-level) language and as the embedded language. This tutorial uses original O'Caml as the host language and revised dialect as the embedded language. This choice is reflected in the camlp4 executable that we use here, camlp4orf. See Using Camlp4 for the explanation and additional options.

We can verify that it works:

$ ocaml
        Objective Caml version 3.10.0

# #load "camlp4o.cma";;
        Camlp4 Parsing version 3.10.0

# #load "pa_foreach.cmo";;
# for s in List ["hello"; "world"] do print_endline s done;;
hello
world
- : unit = ()
#

Now we have a rudimentary syntax extension. However, it is quite restrictive. For example, it would be nice to do pattern matching like this:

for (k, v) in List [(0, "hello"); (1, "world")] do
  Printf.printf "%d: %s\n" k v
done

And it doesn't support Hashtbl.iter yet because, so far, we can only generate functions with one argument. Hashtbl.iter passes two arguments to the function, the key and the value. We will address these issues in the next section.

Supporting Multiple Patterns

We're now beginning to use Camlp4 features that are less documented. It is useful to keep in mind that syntax extension is modifying existing parsing rules, and we can't do without an understanding of how existing rules work. It is recommended that you take a quick look at two files in the O'Caml source distribution under ocaml-3.10.0/camlp4/Camlp4Parsers:

Camlp4OCamlRevisedParser.ml; this provides the bulk of the syntax rules.
Camlp4OCamlParser.ml implicitly borrows most of its rules from Camlp4OCamlRevisedParser, but clears some of the rules that are particular to the revised syntax and replaces them with the original O'Caml syntax.

From these files, we can see that ipatt rule supplies what we want for matching patterns, and LIST1 modifier allows us to take one or more patterns. There is also a LIST0 modifier for "zero or more" matching.

open Camlp4.PreCast
open Syntax

let rec mkfun _loc patts e =
  match patts with
  | p :: patts ->
      <:expr< fun $p$ -> $mkfun _loc patts e$ >>
  | [] ->
      e

EXTEND Gram
  expr: LEVEL "top"
    [ [ "for"; patts = LIST1 ipatt; "in"; m = a_UIDENT; e = expr; "do";
        seq = sequence; "done" ->
          let f = mkfun _loc patts seq in
          <:expr< $uid:m$.iter $f$ $e$ >>
      ] ]
  ;
END

In order to generate a higher-order function according to a list of patterns, we use the helper mkfun that essentially generates, for patterns patt1 ... pattN,

fun patt1 ->
  ...
    fun pattN ->
      seq

which is the desugared form for fun patt1 ... pattN -> seq.

We can verify that pattern works:

$ ocamlc -I +camlp4 camlp4lib.cma -pp camlp4orf -c pa_foreach.ml
$ ocaml
        Objective Caml version 3.10.0

# #load "camlp4o.cma";;
        Camlp4 Parsing version 3.10.0

# #load "pa_foreach.cmo";;
# for (k, v) in List [(0, "hello"); (1, "world")] do
    Printf.printf "%d: %s\n" k v
  done;;
0: hello
1: world
- : unit = ()

and that multiple arguments work:

# let tbl = Hashtbl.create 3;;
val tbl : ('_a, '_b) Hashtbl.t = <abstr>
# Hashtbl.add tbl 0 "hello";;
- : unit = ()
# Hashtbl.add tbl 1 "world";;
- : unit = ()
# Hashtbl.add tbl 2 "foobar";;
- : unit = ()
# for k v in Hashtbl tbl do
    Printf.printf "%d: %s\n" k v
  done;;
0: hello
1: world
2: foobar
- : unit = ()

There are still some subtle problems with this approach. In the next section, we'll discuss solutions to these problems.

Final Touch-up

The first problem we find out is that foreach body is supposed to be a sequence, but that doesn't work:

# for s in List ["hello"; "world"] do print_endline s; () done;;
  ;;
Failure: "expr; expr: not allowed here, use do {...} or [|...|] to surround them"

The problem is due to the desugared form, which becomes:

List.iter (fun s -> print_endline s; ()) ["hello"; "world"]

and this is ambiguous. Do we interpret the expression fun s -> print_endline s; () as (fun s -> print_endline s); () or fun s -> (print_endline (); ())? In order to avoid ambiguity, we wrap the sequence using do {$seq$} when it is a sequence. The function mksequence' in Camlp4OCamlRevisedParser.ml provides some inspiration how to handle this.

FYI: see this post by Nicolas Pouillard for a discussion of a similar issue.

We also notice that the original for-loop syntax ceases working:

# for i = 0 to 5 do print_int i done;;
Parse error: [ipatt] or [a_LIDENT] expected after "for" (in [expr])

It turns out that the new rule we add for "for" doesn't play nice with the original rule. The solution is to rewrite the original rule so it becomes compatible with the new rule.

The code listing that addresses these issues can be found below:

open Camlp4.PreCast
open Syntax

let mksequence _loc e =
  match e with
  | <:expr< $_$; $_$ >> as e -> <:expr< do {$e$} >>
  | e -> e

let rec mkfun _loc patts e =
  match patts with
  | p :: patts ->
      <:expr< fun $p$ -> $mkfun _loc patts e$ >>
  | [] -> mksequence _loc e

let lident_of_patt p =
  match p with
  | <:patt< $lid:i$ >> -> i
  | _ -> invalid_arg "lident_of_patt"

DELETE_RULE Gram
  expr:
    "for"; a_LIDENT; "="; sequence; direction_flag; sequence; "do"; do_sequence
END

EXTEND Gram
  expr: LEVEL "top"
    [ [ "for"; i = ipatt; "="; e1 = sequence; df = direction_flag;
       e2 = sequence; "do"; seq = do_sequence ->
         <:expr<
           for $lident_of_patt i$ =
             $mksequence _loc e1$ $to:df$ $mksequence _loc e2$
             do { $seq$ }
         >>
      | "for"; p = ipatt; patts = LIST0 ipatt; "in"; m = a_UIDENT; e = expr;
       "do"; seq = do_sequence ->
         let patts = p :: patts in
         let f = mkfun _loc patts seq in
         <:expr< $uid:m$.iter $f$ $e$ >>
      ] ]
  ;
END

In order for alternative rules to "fall through" correctly, rules must share a common prefix, and the rest of the non-terminals must be distinct enough. The way we structured the "for" syntax before, we essentially have the following two rules:

EXTEND Gram
  expr: LEVEL "top"
    [ [ "for"; patts = LIST1 ipatt; "in"; m = a_UIDENT; e = expr;
       "do"; seq = do_sequence ->
         (* ... *)
      | "for"; i = a_LIDENT; "="; e1 = sequence; df = direction_flag;
       e2 = sequence; "do"; seq = do_sequence ->
         (* ... *)
      ] ]
  ;
END

Once the "for" keyword is matched, the parser proceeds to match the next token, but ipatt also matches a_LIDENT (a lowercase identifier is also a legal pattern). At this point, the parser fixes on a rule (whichever comes first) and no longer backtracks to the alternative option.

The solution is to make the original for loop always match ipatt, but refuse anything except a lowercase identifier.

We can now verify that everything works as expected.

$ ocamlc -I +camlp4 camlp4lib.cma -pp camlp4orf -c pa_foreach.ml
$ ocaml
        Objective Caml version 3.10.0

# #load "camlp4o.cma";;
        Camlp4 Parsing version 3.10.0

# #load "pa_foreach.cmo";;
# let tbl = Hashtbl.create 3;;
val tbl : ('_a, '_b) Hashtbl.t = <abstr>
# Hashtbl.add tbl 0 "hello";;
- : unit = ()
# Hashtbl.add tbl 1 "world";;
- : unit = ()
# Hashtbl.add tbl 2 "foo";;
- : unit = ()
# for k v in Hashtbl tbl do Printf.printf "%d: %s\n" k v done;;
0: hello
1: world
2: foo
- : unit = ()
# for i = 0 to 5 do print_int i done;;
012345- : unit = ()
# for k, v in List [0, "hello"; 1, "world"; 2, "foo"] do
    Printf.printf "%d: %s\n" k v;
    ()
  done;;
0: hello
1: world
2: foo
- : unit = ()

Conclusion

We begin with a simple Camlp4 syntax extension for the for-each iterator syntax, then explored further refinements to that idea that works around parsing issues both in embedded code and in the rules.

This tutorial is written by Likai Liu. The text has not been peer reviewed, so use this at your own risk. Questions and comments are welcome.

Major revisions:

2007-08-20: initial draft.
2007-08-21: explained more about "final touch-up" + minor edits throughout.

Camlp4 3.10 » Foreach Tutorial