next up previous contents index
Next: Class LexerBuffer Up: Lexer Specification Previous: Tokens

The matchscan statement

 

The matchscan statement is used to perform tokenization. The user can specify a set of string pattern matching rules within a matchscan construct. Given a object of class LexerBuffer, the matchscan statement looks for the rule that matches the longest prefix from the input stream and executes the action associated with the rule. Ties are broken by the lexical ordering of the rules.

The general syntax is as follows:


Matchscan ::=Matchscan_Mode [ Context_Spec ] ( Exp ) variant 1
{ case Matchscan_Rule  ¼ case Matchscan_Rule }
  |  Matchscan_Mode [ Context_Spec ] ( Exp ) variant 2
{ Matchscan_Rule  - ¼ - Matchscan_Rule }
  |  Matchscan_Mode [ Context_Spec ] ( Exp ) of variant 3
    Matchscan_Rule  - ¼ - Matchscan_Rule
end matchscan ;
Matchscan_Mode ::=matchscan [ while ] case sensitive
  |  matchscan* [ while ]case insensitive
Context_Spec ::=[ Id, ¼ , Id ]
Matchscan_Rule ::=[ << Context , ¼ , Context >> ]
    lexeme class Id : Matchscan_Action
  |  [ << Context , ¼ , Context >> ]
    Regexp : Matchscan_Action
Matchscan_Action ::={ Code }
  |  Code for variant 1 only

The two different modes of operation are matchscan and matchscan*, which respectively match strings case sensitively and insensitively. The modifier where may optionally specify that the matching process should repeat until no rules apply, or the end of stream condition is reached.

By default, if no rules apply and if the input stream is non-empty, then an error has occurred. The matchscan statement will invoke the method error() of the LexerBuffer object by default.

For example, the following is part of the Prop lexer specification.

   datatype LexicalContext = NONE | C | PROP | COMMENT | ...;

int PropParser::get_token()
{
   matchscan[LexicalContext] while (lexbuf)
   {
      ...
   |  <<C>> /[ \t\\\014]/:            { emit(); }
   |  <<C>> /(\/\/.*)?\n/:            { emit(); line++; }
   |  <<C>> /^#.*/:                   { emit(); }
   |  <<PROP>> lexeme class MainKeywords: { return ?lexeme; }
   |  <<PROP>> lexeme class SepKeywords: { return ?lexeme; }
   |  <<PROP>> QUARK_TOK:             { return QUARK_TOK; }
   |  <<PROP>> BIGINT_TOK:            { return BIGINT_TOK; }
   |  <<PROP>> REGEXP_TOK:            { return REGEXP_TOK; }
   |  <<PROP>> PUNCTUATIONS:          { return lexbuf[0]; }
   |  <<PROP>> /[ \t\014]/:           { /* skip */ }
   |  <<PROP>> /(\/\/.*)?\n/:         { line++; }
   |  /\/\*/:                         { emit(); set_context(COMMENT); }
   |  <<COMMENT>> /\*\//:             { emit(); set_context(PROP); }
   |  <<COMMENT>> /\n/:               { emit(); line++; }
   |  <<COMMENT>> /./:                { emit(); }
   |  /./: { error("%Lillegal character %c\n", lexbuf[0]); }
   }
}

Here, the lexer is partitioned in multiple lexical contexts: context C deals with C++ code while context PROP deals with Prop extensions. The special context COMMENT is used to parse /* */ delimited comments. Contexts are changed using the set_context method defined in class LexerBuffer.

The special variable ?lexeme can be used within a rule that matches a lexeme class. For example, within the rule

   |  <<PROP>> lexeme class MainKeywords:    { return ?lexeme; }
the variable ?lexeme is bound to the token ."rewrite" if the string ``rewrite'' is matched; it is bound to the token ."inference" if the string ``inference'' is matched and so on.



Allen Leung
Mon Apr 7 14:33:55 EDT 1997