Syntactic formalisms

Next: Meta syntax Up: The Language Previous: Inference

Syntactic formalisms

Language processors frequently have to manipulate the language in the source level. Instead of depending on external tools such as parser generators and lexer generators, parsing and lexical analysis constructs are integrated with the Prop language in the form of syntax classes and lexer classes.

In addition, higher level syntactic formalisms are also available. Some of these are:

Lisp-like meta quoting. Using meta-quoting, programs can manipulate a language in its source level syntax. Translation into the abstract syntax is done automatically by the translator. This makes it possible to express complex program transformations using a succinct syntax.
Pretty printing. Algebraic datatypes may specify their pretty printed format in a declarative manner.

Lexical analysis

The scanner class mechanism in Prop provides functionalities similar to what is offered in modern lexical analyzer generators such as lex[Les75], flex[Pax90], and dlg[PDC91] etc. Instances of scanner classes are simply lexical scanners (with possibly additional encapsulated states) taking input from a I/O stream and generating a stream of lexemes.

Unit datatypes (i.e. algebraic datatypes with only non-argument taking variants) can be used to stand for lexemes generated by a lexical scanner. These lexemes are also used as terminals inside a syntax definition. Optional string and/or regular expression patterns may be attached to a lexeme, which can used to specialize a lexical scanner inside a scanner class. For example, the following datatype definition defines a set of lexemes for the C language.

  datatype C_Token :: lexeme
     = IF        "if"
     | ELSE      "else"
     | FOR       "for"
     | WHILE     "while"
     | GOTO      "goto"
     | CONTINUE  "continue"
     | BREAK     "break"
     | RETURN    "return"
     | ID        /[a-zA-Z_][a-zA-Z_0-9]*/
     | INTEGER   /[0-9]+/
     | STRING    /"([^"\n]|\\ .)"/
     | CHARACTER /'([^'\n]|\\ .)'/
     | ...  // etc.
     ;

In the above, we use the constructor IF stands for the lexeme "if", ELSE to stand for "else", etc. However, it is frequently more convenient to simply use the same name for the abstract and the printed representation for keywords. We allow the user to abbreviate the lexeme constructors definitions by specifying only the printed representations. For example, we may rewrite the previous datatype definition as follows:

  datatype C_Token :: lexeme
     = "if" | "else" | "for" | "while" | "goto"
     | "continue" | "break" | "switch" | "case"
     | "return" | "default" | "struct" | "union"
     | "register" | "volatile" | "extern" | "static"
     | ID        /[a-zA-Z_][a-zA-Z_0-9]*/
     | INTEGER   /[0-9]+/
     | STRING    /"([^"\n]|\\ .)"/
     | CHARACTER /'([^'\n]|\\ .)'/
     | ...  // etc.
     ;

Now, to refer to the constructor with the printing representation "if", T"if" can be used. In general, T"???" refers to the token with a printed representation of "???".

Lexeme abbreviations

Complex regular expressions can be built from simple ones using lexeme aliases. These aliases are defined within the lexeme alias definitions. For example, complex regular expressions standing for integers, reals and friends can be defined by composition as follows. Aliases appearing in regular expressions are quoted by braces, a syntactic convention inherited from yacc-like tools:

   lexeme digits   = /[0-9]+/
        | sign     = /[\-\+]/
        | integer  = /{sign}?{digits}/
        | exponent = /[eE]{integer}/
        | fraction = /\.{digits}/
        | mantissa = /{sign}?({fraction}|{digits}{fraction}?)/
        | real     = /{mantissa}{exponent}?/
        ;

Lexeme aliases can be reused inside any regular expressions, including patterns inside lexeme datatype definitions.

Lexeme classes

Parser construction

The syntax class mechanism in Prop is the functional equivalent of popular parser generators such as yacc[], or bison[]. A syntax class is used to encapsulate the state of a parser for one grammar. Grammar rules are specified in a BNF-like form similar to that of yacc. Prop currently generates table driven LALR(1) parsers. Shift/reduce conflicts can be resolved using optional operator precedence information, or with optional semantics predicates similar to that in PCCTS[PDC91]. Actions may be attached to a grammar, and during parsing, these actions have access to the member data and functions, and inherited and synthesized attributes.

Consider the following partial parser specification for a simple language.

syntax class SmallLang
{
public:
   SmallLang(istream&) : Super(istream&) {}
  ~SmallLang() {}
};

The production rules of the language are encapsulated in a syntax definition, which also contains the precedence and associativity rules for operators. Notice that terminals may be a single character or a predefined lexeme datatype.

syntax SmallLang
{  left: 1 '*' '/';
   left: 2 '+' '-';

   stmt(Stmt): IF expr THEN stmt ELSE stmt ';' { $$ = IFstmt($2,$4,$6); }
             | WHILE expr DO stmt DONE ';'     { $$ = WHILEstmt($2,$4); }
             | expr '=' expr ';'               { $$ = ASSIGNstmt($1,$3); }
             | BEGIN stmt_list END ';'         { $$ = BLOCKstmt($2); }
             ;
   expr(Exp): integer       { $$ = INT($1);  }
            | ident         { $$ = ID($1);    }
            | expr '+' expr { $$ = ADD($1,$3); }
            | expr '-' expr { $$ = SUB($1,$3); }
            | expr '*' expr { $$ = MUL($1,$3); }
            | expr '/' expr { $$ = DIV($1,$3); }
            | '-' expr      { $$ = UMINUS($2); }
            ;
   integer(int): INTEGER    { $$ = atol(lexer.text()); };
   id(int):      IDENTIFIER { $$ = strdup(lexer.text()); };
};

It is often more convenient to specify the terminals (lexemes) in their printed form rather than in their encoded form. As a short hand, the printed form of a lexeme can be used in place of the lexeme itself inside a syntax specification. For example, the statement productions can be specified alternatively as follows:

   stmt(Stmt): "if" expr "then" stmt "else" stmt ';' { $$ = IFstmt($2,$4,$6); }
             | "while" expr "do" stmt "done" ';'     { $$ = WHILEstmt($2,$4); }
             | expr '=' expr ';'                     { $$ = ASSIGNstmt($1,$3);}
             | "begin" stmt_list "end" ';'           { $$ = BLOCKstmt($2); }
             ;

The structure of the abstract syntax tree can be specified as follows:

   datatype Stmt = IFstmt(Exp,Stmt,Stmt)
                 | WHILEstmt(Exp,Stmt)
                 | ASSIGNstmt(Exp,Exp)
                 | BLOCKstmt(List<Stmt>)
   and      Exp = INT (int)
                | ID  (const char *)
                | UMINUS (Exp)
                | ADD (Exp, Exp)
                | SUB (Exp, Exp)
                | MUL (Exp, Exp)
                | DIV (Exp, Exp)
   and List<T>  = #[] | #[ T ... List<T> ]
                ;

Inherited and synthesized attributes

Semantic predicates

Real life programming languages commonly have non-context free fragments that make it difficult to fit into a parser generator's framework. For example in C , the code fragment

   T * U;

can be either a variable declaration if T is a type identifier, or an expression if T is a variable. To deal with these types of irregularities, parsers, lexers and the semantic analysis routines frequently have to cooperate and provide feedback to each other during the parsing process. Frequently this involves the insertion of non-declarative actions inside the parser or lexer specifications.

We borrow a new construct, the semantic predicate, which allows a clean separation of these context sensitive feedback, from the parser generator tool PCCTS[PDC91] Aside from terminals, non-terminals and semantic actions, a production can also contain semantic predicates. During parsing, these semantic predicates are evaluated, and productions which contain unsatisfied predicates will be rejected.

For example, to deal with the type/variable identifier ambiguity in a language like C , the following productions are needed.

   Stmt:  id (is_type_id($1)) '*' id ';' // variable definition
       |  id '*' id ';'                  // otherwise, an expression

Pretty printing

Pretty printing of datatypes in an external format is a frequently needed operation and Prop also provide a set of formalism to allow rapid specification of pretty printers. By default, if the printable qualifier is used within the definition of a datatype, a pretty printer will be automatically generated to print a datatype value in Prop syntax. For example, if the expression type Exp is defined as:

   datatype Exp :: printable = ...

the pretty printer method

  ostream& operator << (ostream&, Exp);

will be automatically generated. This pretty printer can be used as a debugging aid, for example, during rapid prototyping.

To get more refined output, the user can attach datatypes with pretty printing formats, which are simply specifications of how a datatype can be linearized for printing. Printing format strings are attached to each constructor after the separator => within a datatype definition. For example, in the following we specify that integers and identifiers in the datatype Exp are to be printed as-is, while the binary arithimetic expressions are to be printed with surrounding parentheses:

  datatype Exp = INT (int)          => _
               | ID  (const char *) => _
               | ADD (Exp, Exp)     => "(" _ " + " _ ")"
               | SUB (Exp, Exp)     => "(" _ " - " _ ")"
               | MUL (Exp, Exp)     => "(" _ " * " _ ")"
               | DIV (Exp, Exp)     => "(" _ " / " _ ")"
               ;

The underscore character _ above is an example of a meta format character. Its meaning is to print the next available argument of the constructor using whatever format appropriate for its type. Other meta format characters are also available, which deal with issues such as argument ordering, indention and parenthesization, etc. Please see figure 1 for a detailed listing.

Figure 1: Meta pretty printing formats.

As another example, consider the following datatype definition of an AST for statements. The meta-characters { and } are used to indent statements occurring in nested scopes:

   datatype Stmt =
      IF { cond : Exp,
           yes  : Stmt,
           no   : Stmt
         }               => "if" cond "then" { yes } "else" { no } "endif;"
   |  WHILE (Exp, Stmt)  => "while" _ "do" { _ } "end do;"
   |  ASSIGN (Id, Exp)   => _ ":=" _ ";"
   |  BLOCK (List<Stmt>) => "begin" { _ } "end;"

Thus using this definition, the datatype expression

   WHILE(ID("x"), #[ ASSIGN("x",SUB(ID("x"),INT(1))),
                     ASSIGN("y",MUL(ID("y"),ID("x"))) ]);

is pretty printed as

   while x do
      begin
         x := x - 1;
         y := y * x;
      end;
   end do;

Generally speaking, the pretty printing generation capability of Prop is meant to be used for simple data structures. Complex printing mechanisms can be handled using the pattern matching or rewriting constructs.

Next: Meta syntax Up: The Language Previous: Inference

Allen Leung
Wed Mar 6 20:55:43 EST 1996