2011年5月25日水曜日

Planck: A Small Parser Combinator Library for OCaml

Planck: A Small Parser Combinator Library for OCaml

I have released Planck, a small monadic parser combinators for OCaml. It includes a proof of implementation: OCaml syntax lexer and a parser which create 100% identical AST as the original OCaml's lex+yacc, and available at https://bitbucket.org/camlspotter/planck .

Planck is yet another Parsec clone. I have heard of Parsec as "monad for parsing" last year, and started writing Planck just from the phrase, trying to avoid to know what Parsec really is. Sometimes I need such a kind of puzzle to enjoy solving. :-) Interestingly, the implementation of Planck became quite similar to Parsec in retrospect. Therefore, once after the basic functionality is implemented, I have imported some of Parsec combinator names to Planck, such as <|>.

At first it was aimed to be my April-fool project of this year. Unfortunately, the earthquake hit Japan on 3/11. I, my family and relatives are all ok, and Tokyo is safe. But the event surely slowed down the development.

Planck internal

Planck consists of two layers, one is for streams (lazy lists) whose module names start with `S', and the other is for parser combinators whose module names start with `P'.

The implementation is quite straight forward. Streams define lazy lists with positions and other attributes, and parser combinators provide lazy list manipulation over them. Combinators are monadic and easily composable using bind (>>=). For example, the following is an extract from the rule for hex literals like 0xdead_beef:

(* Using pa_monad's perform, aka Haskell's do *)

let zero = token '0'

let hex_char : char t =
  tokenp (function
    | '0' .. '9' | 'A' .. 'F' | 'a' .. 'f' -> true
    | _ -> false)
     "hex"

let char_or_underscores : 'a t -> string t = fun f ->
  matched (perform
             ignore f;
             ?* (ignore f <|> underscore))

let hex_literal : string t = perform
  matched (perform
    zero;
    ignore (one_of ['x'; 'X']);
    ignore (char_or_underscores hex_char))
For efficiency, Sbuffer and Pbuffer provide (hopefully) fast handling of char streams and their parsing. Smemo module provides memoization over stream elements, which is indispensable for efficient parser with backtracking.

Planck also provides a module Op_prec for operator connectivity and precedences. Therefore, it is ready to define parsers of programming languages with rather complex syntax.

Experience report: Parsing OCaml source code with Planck

As a working example of Planck, I have written a lexer and parser of OCaml syntax in Planck, and managed to produce the identical AST as the original ocamllex+ocamlyacc! Here identical means exactly the same tree, with exactly the same position information, which is impossible for current CamlP4 (It has some bugs around position handling...).

BTW, if you have already a complex lex+yacc syntax definition for you language, just stick to it. LL(n) parser combinator is great if you write a syntax from scratch. Porting lex+yacc to LL(n) is just error prone and waste of time.

Porting lex rules

The lexer part reads a char stream and produces a stream of lex tokens.

Porting the lexer definition lexer.mll to Planck was by hand, since lexer.mll is not very huge (just 302 lines). It was not so complicated task to port its pairs of a regexp and an action into Planck, only what I had to be careful was the fact that Lex takes the longest match. You can see the source code in planck/ocaml/lex.ml .

Porting yacc rules

The parser part mimics yacc behavior: it reads a stream of lex tokens created by the lexer above and returns a parsed AST.

Porting the yacc rules, parser.mly, was hard, since Yacc is a bottom-up LALR(1) and Planck (and so is Parsec) a top-down LL(n).

I was not a researcher of parsing nor have any plan to get another PhD of parsing in future, so I just took a stupid and simple method with some of brute force. If you know something nicer to convert Yacc to LL(n) automagically, please tell me.

Eliminating left recursions

Unlike lexer.mll (302 lines), parser.mly is huge (1669 lines), and hand-porting those rules one by one had to be avoided. Therefore, I firstly wrote a small parser of .mly files in Planck planck/ocaml/ocamlyacc.ml and obtained the rules and actions. Then I have analyzed the dependency graph of those rules to find out the left recursions. The left recursions are trouble of Planck (and other LL parsers) and must be removed: http://en.wikipedia.org/wiki/Left_recursion .

Luckily enough, I have found there are only 2 groups of mutual left recursions in the OCaml syntax rules. All the other left recursions are direct. Eliminating direct left recursion is simple and the result code is still understandable for human being. (It is also possible to eliminate mutual left recursions automatically... but the code is, usually, something incomprehensible.) I have written a small function to eliminate direct left recursions automatically, then hand-converted the 2 mutual left recursive groups. The one was expr and expr_comma_list, and the other was pattern and pattern_comma_list. As the names suggest they are almost the same and were easy to translate. As a result I have got planck/ocaml/plparser.ml, a source file of nearly 5000 lines (ugh).

Here is some extract of plparser.ml, the translated rule of core_type2:

method core_type2 = leftrec "core_type2" self#core_type2_nonleftrec self#core_type2_leftrec

method core_type2_nonleftrec = (
      case "core_type2_nonleftrec_0" (perform

         token QUESTION;
         v_2 <-- get_LIDENT;
         token COLON;
         (* v_4 <-- self#core_type2 ; *)
         v_4 <-- self#simple_core_type_or_tuple ; (* ??? *)
         token MINUSGREATER;
         v_6 <-- self#core_type2 ;

         return (fun () ->  mktyp(Ptyp_arrow("?" ^ v_2 ,
             {ptyp_desc = Ptyp_constr(Ldot (Lident "*predef*", "option"), [v_4]);
              ptyp_loc = v_4.ptyp_loc}, v_6)) ))

  <|> case "core_type2_nonleftrec_2" (perform

         v_1 <-- get_OPTLABEL;
         (* v_2 <-- self#core_type2 ; *) (* eats too much *)
         v_2 <-- self#simple_core_type_or_tuple ; (* ??? *)
         token MINUSGREATER;
         v_4 <-- self#core_type2 ;

         return (fun () ->  mktyp(Ptyp_arrow("?" ^ v_1 ,
             {ptyp_desc = Ptyp_constr(Ldot (Lident "*predef*", "option"), [v_2]);
              ptyp_loc = v_2.ptyp_loc}, v_4)) ))

  <|> case "core_type2_nonleftrec_1" (perform

         v_1 <-- get_LIDENT;
         token COLON;
         (* v_3 <-- self#core_type2 ; *)
         v_3 <-- self#simple_core_type_or_tuple ; (* ??? *)
         token MINUSGREATER;
         v_5 <-- self#core_type2 ;

         return (fun () ->  mktyp(Ptyp_arrow(v_1, v_3, v_5)) ))

  < !> case "core_type2_nonleftrec_3" (perform

         v_1 <-- self#simple_core_type_or_tuple ; (* may start with LIDENT *)

         return (fun () ->  v_1 ))

    )

method core_type2_leftrec v_1 = (
      case "core_type2_leftrec_0" (perform

         token MINUSGREATER;
         v_3 <-- self#core_type2 ;

         return (fun () ->  mktyp(Ptyp_arrow("", v_1, v_3)) ))

    )
It is hard to read at a glance, but the basic structure is simple. First, the rule core_type is a left recursive rule in Yacc. Therefore the rule is disassembled to two sub rules core_type_nonleftrec and core_type_leftrec for left recursion elimination. The function leftrec at core_type2 is responsible to assemble back the result of these two rules together. Rules are defined as methods and form a huge class, therefore you can extend the parsing rules by class inheritance, if you dare...

A sub rule consist of cases, which exactly correspond with the left-recursion-eliminated Yacc cases. These cases are listed by <|> or operators. <|> is the same as Parsec's <|>. < !> is <|> with backtracking: x < !> y == try x <|> y in Parsec.

A case is a list of symbol parsing using monadic binds. Here I use pa_monad Camlp4 syntax sugar (perform and <--). For example, you can see the first case of core_type2_nonleftrec is equivalent to ocamlyacc's case description:

QUESTION LIDENT COLON simple_core_type_or_tuple MINUSGREATER core_type2
The action part is in the monadic return. The code is just copy-and-pasted automatically from the conversion program. Functions used in the action parts are wrapped to provide the correct position information. It was... the hardest part. I had to use some bad mannered side-effects. But once done, the parser starts to create the identical AST as the originals nicely.

Backtracking and memoization

(Maybe Parsec does the same thing for backtracking, but I am not sure.)

Usually in LL parser combinators, backtracking ( in Planck, try in Parsec) must be avoided as possible, since it might perform tons of unnecessary computations many times and lowered down the performance of your parser. Therefore you are advised to use <|>, the or combinator without backtracking: if the left argument of <|> fails consuming any token, <|> immediately fails, without trying its right argument. And to use <|> you have to modify your rules... by hand, normally.

Writing rules without backtracking is not very hard, when you write the syntax from scratch. It is horribly boring, however, if you are porting existing huge set of Yacc rules using parser combinators without backtracks. So I have just gave up the backtrack elimination.

Instead, I have introduced memoization. The result of parsing method calls are recorded to memos attached to each stream element. If the same method is called at the same stream element, the saved result is immediately returned, instead of recalculate it. Of course, memoization enforces your method functions somewhat pure but normally it is true for parsing. The memoization works pretty nice, and I could have the results in "realistic" time. Without memoization... things go often exponential and you never get the results.

Debugging. Try and error

Once after the basic translation is done, testing. You can see lots of my test files at planck/ocaml/test*.ml . Unfortunately, the left recursion elimination is not all of the required conversions from Yacc to Planck (Parsec). The both frameworks have notion of rule precedences and they seem to be incompatible. I had to tweak the ordering of cases a lot by try and error basis. And I also had to move small amount of cases (less than 10, AFA I remember) of one rule to another. It was straightforward if you remember how yacc works. It was also fun to see that the cover rate of my parser gradually went up to 100%.

Finally my parser now can parse all the vanilla (non CamlP4 or other preprocessor flavored) OCaml .ml and .mli source code in OCaml compiler itself (one of the biggest OCaml application publicly available) and LablGtk (one of the biggest OCaml OO-library), and produces the identical AST as ocamllex+ocamlyacc. Of course it is not a proof of the correctness but it should cover pretty good part of OCaml syntax, I believe. If you are lucky to compile Planck (it is hard since it depends lots of things), you can try planck/ocaml/ocamltest.sh and planck/ocaml/ocamltest-lablgtk.sh to see the test results.

Performance of Planck

The time of parsing of all the OCaml compiler source code by Planck is in a realistic time. Great. Nice. Period. What? How fast is it, compared to ocamllex+ocamlyacc? You dare ask me it? Sigh...

OK, it is... x100 slower. In my environment, Planck parses all the OCaml compiler source code in around 100 secs while ocamllex+ocamlyacc parse it just less than 1 sec.

I can say, however, x100 slowness is not a bottle neck. For example, compiling the whole OCaml compiler takes minutes. You might not know that ocamlex and ocamlyacc are not 100% written in OCaml but use some of C for efficiency. Still you say Planck is slow? Hmmm. OK.

I have quickly considered why Planck is so slow. All here are my guesses:

Char streams.
The speed ratios of lexer (planck/ocaml/lex.ml) and parser (planck/ocaml/plparser.ml) are almost the same. x100. I had expected better performance in the lexer since I have written a specialized stream and parser module for chars (Sbuffer and Pbuffer). Probably I have not yet improve them enough.
Backtracking and memoization.
Backtracking rate in lexer is 0.2%, that is 0.2% of monadic bind calls are wasted and their results are thrown away. So the lexer's backtracking is not a big issue, and actually I do not put memoization into the lexer but only in the parser. Parser's backtracking rate is now 57%. You may think it is huge. However, I had almost the same parsing speed when I had tons of backtracks <!> and 90% of the rate. What I have seen was: removing more backtracking by replacing <!> by <|>, the total number of bind calls became larger. So in my case, the golden rule of LL(n) parser combinator: "avoid backtracks as possible", was not very true. It is also possible too many cache queries slow down the parser; for now, all the parsing rule methods query the cache. Maybe some clever cache storage/query strategy improves the performance.
No pattern match, but sequence of if-then-elses.
The parsing rules are huge sets of lists connected by <|> operators, simply speaking, sequences of if-then-elses. It might be great if these sequences of if-then-elses could be simplified to one pattern matching or at least one hash table query...
Too many monadic bind calls.
If you execute ocamltest.sh, you can see how many time the monadic bind is called to parse the entire OCaml compiler source: 1 billion!! (50% for lexer and 50% for parser.) OCaml is very great, since it performs 1 billion of bind calls (and closure creations) in 1 minute! However, the results of subtests indicate that the time of parsing is almost proportional to the number of bind calls.

OCaml is not appropriate for heavy Monadic programming?

Elimination of binds by inlining + partial evaluation

We should simply get better results, if we can reduce the number of bind calls. Let's discuss this issue further.

Planck's monad is pretty simple functional one (pbase.ml):

type 'a t = Str.t -> ('a * Str.t, error) Result.t

let return v = fun st -> Ok (v, st)

let bind t f = fun st ->
  match t st with
  | Ok (r, st') -> f r st'
  | Error s -> Error s
The monad type 'a t is a function, which takes a stream and do something, and if successful it returns a result value of type 'a and the rest of the stream after the job. If failed, it returns an error. Ok and Error are a simple result monad, defined in result.ml (aka Haskell's Either).

Look at the definition of bind: one application of bind, bind t f creates a closure. OCaml is basically a functional language and therefore the cost of the closure construction is tiny, but it is not negligible. If we could eliminate this closure construction somehow, the performance should get better.

Actually, in Planck, and probably other functional monadic libraries too, there are lots of opportunities of this closure construction elimination:

t >>= fun r st -> e[r,st]
Here, e[r,st] is an expression which has free occurrences of r and st in it. This form of the expression is found everywhere in the parsing rules. If we expand the bind (>>=), then it is equal to the following:
fun st' ->
  match t st' with
  | Ok (r, st) -> (fun r st -> e[r,st]) r st
  | Error s -> Error s
The expression of the Ok case is actually eta contractable, and we get:
fun st' ->
  match t st' with
  | Ok (r, st) -> e[r,st]
  | Error s -> Error s
Thus we now have one less closure creation! If this expression is preceded by other bind (t' >>= fun r' -> ...), then we can do the same conversion again and again.

Inlining + partial evaluation by CamlP4

If this inlining and partial evaluation could be done automatically by OCaml compiler, it would be pretty nice. But unfortunately, the current version of the compiler does not seem to perform such kind of optimization eagerly.

If the compiler does not do the job, I could do it by myself, and I did. I wrote a small CamlP4 syntax extension module, planck/pa_bind_inline/pa_bind_inline.ml, which performs the inlining and auto eta contractions described above around (>>=), (<|>) and ():

(* Extract from planck/pa_bind_inline/pa_bind_inline.ml *)

module Make (AstFilters : Sig.AstFilters) = struct
  open AstFilters

  let bind _loc t f =
    match f with
    | <:expr< fun $lid:x$ $lid:st$ -> $e$ >> ->
        (* rewrite [bind t (fun x st -> e)] *)
        <:expr< fun st__ ->
          Profile.incr ();
          match $t$ st__ with
          | Result.Ok ($lid:x$, $lid:st$) -> $e$
          | Result.Error e -> Resut.Error e >>,
        true

    | ... (* rules for <|> and  follow *)

  let _ =
    let simplify = object (self)
      inherit Ast.map as super

      method! expr e =
        match super#expr e with
        | <:expr@_loc< $lid:x$ $t$ $f$ >> when x = ">>=" || x = "bind" ->
            (* use [bind] to rewrite [bind t f] *)
            let e, x = bind _loc t f in
            if x then self#expr e else e

        | ... (* rules for <|> and  follow *)

        | x -> x
    end
    in AstFilters.register_str_item_filter simplify#str_item

end
With this hack, the speed ratio to ocamllex/ocamlyacc went down to x70, which means gaining x1.4 of speed. If you are not satisfied with this, I see still lots of not inlined calls, so you can inline more of them.

Current OCaml is too simple, not ready for Monad booms from Haskell world. But how about in future?

As we have seen, unfortunately I have a feeling that the current version of OCaml compiler (3.12.0) is too simple to write efficient programs with heavy use of functional monads like Planck. Normally compositions of functional monads by bind have lots of opportunities of inlining + partial evaluations(eta contractions), however, OCaml cannot make use of them. OCaml has proven that strongly typed functional programming languages can provide scalable and fast executables even with rather simple compiler implementations without decent optimizations. But now, OCaml's simplicity seems to become its inferiority compared to its cousins like F# and Haskell...

There is a hope, however. Recently OCamlPro http://www.ocamlpro.com/ is founded and these guys are working hard on various compiler enhancements, such as inline-more ( https://github.com/OCamlPro/OCamlPro-OCaml-Branch ). It does not work well for my parser example for the moment, but I am really looking forward to seeing the branch boosts it one day!