Chapter 12 |
Lexer and parser generators (ocamllex, ocamlyacc) |
|
This chapter describes two program generators: ocamllex, that
produces a lexical analyzer from a set of regular expressions with
associated semantic actions, and ocamlyacc, that produces a parser
from a grammar with associated semantic actions.
These program generators are very close to the well-known lex and
yacc commands that can be found in most C programming environments.
This chapter assumes a working knowledge of lex and yacc: while
it describes the input syntax for ocamllex and ocamlyacc and the
main differences with lex and yacc, it does not explain the basics
of writing a lexer or parser description in lex and yacc. Readers
unfamiliar with lex and yacc are referred to ``Compilers:
principles, techniques, and tools'' by Aho, Sethi and Ullman
(Addison-Wesley, 1986), or ``Lex & Yacc'', by Levine, Mason and
Brown (O'Reilly, 1992).
12.1 |
Overview of ocamllex |
|
The ocamllex command produces a lexical analyzer from a set of regular
expressions with attached semantic actions, in the style of
lex. Assuming the input file is lexer.mll, executing
ocamllex lexer.mll
produces Caml code for a lexical analyzer in file lexer.ml.
This file defines one lexing function per entry point in the lexer
definition. These functions have the same names as the entry
points. Lexing functions take as argument a lexer buffer, and return
the semantic attribute of the corresponding entry point.
Lexer buffers are an abstract data type implemented in the standard
library module Lexing. The functions Lexing.from_channel,
Lexing.from_string and Lexing.from_function create
lexer buffers that read from an input channel, a character string, or
any reading function, respectively. (See the description of module
Lexing in chapter 18.)
When used in conjunction with a parser generated by ocamlyacc, the
semantic actions compute a value belonging to the type token defined
by the generated parsing module. (See the description of ocamlyacc
below.)
12.2 |
Syntax of lexer definitions |
|
The format of lexer definitions is as follows:
{ header }
let ident = regexp ...
rule entrypoint =
parse regexp { action }
| ...
| regexp { action }
and entrypoint =
p