Language processors frequently have to manipulate the language in the source level. Instead of depending on external tools such as parser generators and lexer generators, parsing and lexical analysis constructs are integrated with the Prop language in the form of syntax classes and lexer classes.
In addition, higher level syntactic formalisms are also available. Some of these are:
Lexical analysis
The scanner class mechanism in Prop provides functionalities similar to what is offered in modern lexical analyzer generators such as lex[Les75], flex[Pax90], and dlg[PDC91] etc. Instances of scanner classes are simply lexical scanners (with possibly additional encapsulated states) taking input from a I/O stream and generating a stream of lexemes.
Unit datatypes (i.e. algebraic datatypes with only non-argument taking variants) can be used to stand for lexemes generated by a lexical scanner. These lexemes are also used as terminals inside a syntax definition. Optional string and/or regular expression patterns may be attached to a lexeme, which can used to specialize a lexical scanner inside a scanner class. For example, the following datatype definition defines a set of lexemes for the C language.
datatype C_Token :: lexeme = IF "if" | ELSE "else" | FOR "for" | WHILE "while" | GOTO "goto" | CONTINUE "continue" | BREAK "break" | RETURN "return" | ID /[a-zA-Z_][a-zA-Z_0-9]*/ | INTEGER /[0-9]+/ | STRING /"([^"\n]|\\ .)"/ | CHARACTER /'([^'\n]|\\ .)'/ | ... // etc. ;
In the above, we use the constructor IF
stands for the
lexeme "if"
, ELSE
to stand for "else"
, etc. However,
it is frequently more convenient to simply use the same name for
the abstract and the printed representation for keywords.
We allow the user to abbreviate the lexeme constructors definitions
by specifying only the printed representations. For example, we may
rewrite the previous datatype definition as follows:
datatype C_Token :: lexeme = "if" | "else" | "for" | "while" | "goto" | "continue" | "break" | "switch" | "case" | "return" | "default" | "struct" | "union" | "register" | "volatile" | "extern" | "static" | ID /[a-zA-Z_][a-zA-Z_0-9]*/ | INTEGER /[0-9]+/ | STRING /"([^"\n]|\\ .)"/ | CHARACTER /'([^'\n]|\\ .)'/ | ... // etc. ;
Now, to refer to the constructor with the printing representation "if"
,
T"if"
can be used. In general, T"???"
refers to
the token with a printed representation of "???"
.
Lexeme abbreviations
Complex regular expressions can be built from simple ones using lexeme aliases. These aliases are defined within the lexeme alias definitions. For example, complex regular expressions standing for integers, reals and friends can be defined by composition as follows. Aliases appearing in regular expressions are quoted by braces, a syntactic convention inherited from yacc-like tools:
lexeme digits = /[0-9]+/ | sign = /[\-\+]/ | integer = /{sign}?{digits}/ | exponent = /[eE]{integer}/ | fraction = /\.{digits}/ | mantissa = /{sign}?({fraction}|{digits}{fraction}?)/ | real = /{mantissa}{exponent}?/ ;
Lexeme aliases can be reused inside any regular expressions, including patterns inside lexeme datatype definitions.
Lexeme classes
Parser construction
The syntax class mechanism in Prop is the functional equivalent of popular parser generators such as yacc[], or bison[]. A syntax class is used to encapsulate the state of a parser for one grammar. Grammar rules are specified in a BNF-like form similar to that of yacc. Prop currently generates table driven LALR(1) parsers. Shift/reduce conflicts can be resolved using optional operator precedence information, or with optional semantics predicates similar to that in PCCTS[PDC91]. Actions may be attached to a grammar, and during parsing, these actions have access to the member data and functions, and inherited and synthesized attributes.
Consider the following partial parser specification for a simple language.
syntax class SmallLang { public: SmallLang(istream&) : Super(istream&) {} ~SmallLang() {} };
The production rules of the language are encapsulated in a syntax definition, which also contains the precedence and associativity rules for operators. Notice that terminals may be a single character or a predefined lexeme datatype.
syntax SmallLang { left: 1 '*' '/'; left: 2 '+' '-'; stmt(Stmt): IF expr THEN stmt ELSE stmt ';' { $$ = IFstmt($2,$4,$6); } | WHILE expr DO stmt DONE ';' { $$ = WHILEstmt($2,$4); } | expr '=' expr ';' { $$ = ASSIGNstmt($1,$3); } | BEGIN stmt_list END ';' { $$ = BLOCKstmt($2); } ; expr(Exp): integer { $$ = INT($1); } | ident { $$ = ID($1); } | expr '+' expr { $$ = ADD($1,$3); } | expr '-' expr { $$ = SUB($1,$3); } | expr '*' expr { $$ = MUL($1,$3); } | expr '/' expr { $$ = DIV($1,$3); } | '-' expr { $$ = UMINUS($2); } ; integer(int): INTEGER { $$ = atol(lexer.text()); }; id(int): IDENTIFIER { $$ = strdup(lexer.text()); }; };
It is often more convenient to specify the terminals (lexemes) in their printed form rather than in their encoded form. As a short hand, the printed form of a lexeme can be used in place of the lexeme itself inside a syntax specification. For example, the statement productions can be specified alternatively as follows:
stmt(Stmt): "if" expr "then" stmt "else" stmt ';' { $$ = IFstmt($2,$4,$6); } | "while" expr "do" stmt "done" ';' { $$ = WHILEstmt($2,$4); } | expr '=' expr ';' { $$ = ASSIGNstmt($1,$3);} | "begin" stmt_list "end" ';' { $$ = BLOCKstmt($2); } ;
The structure of the abstract syntax tree can be specified as follows:
datatype Stmt = IFstmt(Exp,Stmt,Stmt) | WHILEstmt(Exp,Stmt) | ASSIGNstmt(Exp,Exp) | BLOCKstmt(List<Stmt>) and Exp = INT (int) | ID (const char *) | UMINUS (Exp) | ADD (Exp, Exp) | SUB (Exp, Exp) | MUL (Exp, Exp) | DIV (Exp, Exp) and List<T> = #[] | #[ T ... List<T> ] ;
Inherited and synthesized attributes
Semantic predicates
Real life programming languages commonly have non-context free fragments that make it difficult to fit into a parser generator's framework. For example in C , the code fragment
T * U;
can be either a variable declaration if T
is a type identifier, or
an expression if T
is a variable. To deal with these types of
irregularities, parsers, lexers and the semantic analysis routines frequently
have to cooperate and provide feedback to each other during the parsing
process. Frequently this involves the insertion of non-declarative actions
inside the parser or lexer specifications.
We borrow a new construct, the semantic predicate, which allows a clean separation of these context sensitive feedback, from the parser generator tool PCCTS[PDC91] Aside from terminals, non-terminals and semantic actions, a production can also contain semantic predicates. During parsing, these semantic predicates are evaluated, and productions which contain unsatisfied predicates will be rejected.
For example, to deal with the type/variable identifier ambiguity in a language like C , the following productions are needed.
Stmt: id (is_type_id($1)) '*' id ';' // variable definition | id '*' id ';' // otherwise, an expression
Pretty printing
Pretty printing of datatypes in an external format
is a frequently needed operation
and Prop also provide a set of formalism to allow rapid specification
of pretty printers. By default, if the printable
qualifier is
used within the definition of a datatype, a pretty printer will be
automatically generated to print a datatype value in Prop syntax.
For example, if the expression type Exp
is defined as:
datatype Exp :: printable = ...
the pretty printer method
ostream& operator << (ostream&, Exp);
will be automatically generated. This pretty printer can be used as a debugging aid, for example, during rapid prototyping.
To get more refined output, the user can attach datatypes with
pretty printing formats, which are simply specifications of how a datatype
can be linearized for printing. Printing format strings are attached
to each constructor after the separator =>
within a datatype
definition. For example, in the following we specify that integers
and identifiers in the datatype Exp
are to be printed as-is,
while the binary arithimetic expressions are to be printed with surrounding
parentheses:
datatype Exp = INT (int) => _ | ID (const char *) => _ | ADD (Exp, Exp) => "(" _ " + " _ ")" | SUB (Exp, Exp) => "(" _ " - " _ ")" | MUL (Exp, Exp) => "(" _ " * " _ ")" | DIV (Exp, Exp) => "(" _ " / " _ ")" ;
The underscore character _
above is an example of a
meta format character. Its meaning is to print the next available
argument of the constructor using whatever format appropriate for its
type. Other meta format characters are also available, which deal with
issues such as argument ordering, indention and parenthesization, etc.
Please see figure 1 for a detailed listing.
Figure 1: Meta pretty printing formats.
As another example, consider the following datatype definition
of an AST for statements. The meta-characters {
and }
are used to indent statements occurring in nested scopes:
datatype Stmt = IF { cond : Exp, yes : Stmt, no : Stmt } => "if" cond "then" { yes } "else" { no } "endif;" | WHILE (Exp, Stmt) => "while" _ "do" { _ } "end do;" | ASSIGN (Id, Exp) => _ ":=" _ ";" | BLOCK (List<Stmt>) => "begin" { _ } "end;"
Thus using this definition, the datatype expression
WHILE(ID("x"), #[ ASSIGN("x",SUB(ID("x"),INT(1))), ASSIGN("y",MUL(ID("y"),ID("x"))) ]);
is pretty printed as
while x do begin x := x - 1; y := y * x; end; end do;
Generally speaking, the pretty printing generation capability of Prop is meant to be used for simple data structures. Complex printing mechanisms can be handled using the pattern matching or rewriting constructs.