The matchscan statement is used to perform tokenization. The user can
specify a set of string pattern matching rules within a matchscan
construct. Given a object of class LexerBuffer
, the matchscan
statement looks for the rule that matches the longest prefix from
the input stream and executes the action associated with the rule.
Ties are broken by the lexical ordering of the rules.
The general syntax is as follows:
The two different modes of operation are matchscan and matchscan*, which respectively match strings case sensitively and insensitively. The modifier where may optionally specify that the matching process should repeat until no rules apply, or the end of stream condition is reached.
By default, if no rules apply and if the input stream is non-empty,
then an error has occurred. The matchscan statement will
invoke the method error()
of the LexerBuffer object by default.
For example, the following is part of the Prop lexer specification.
datatype LexicalContext = NONE | C | PROP | COMMENT | ...; int PropParser::get_token() { matchscan[LexicalContext] while (lexbuf) { ... | <<C>> /[ \t\\\014]/: { emit(); } | <<C>> /(\/\/.*)?\n/: { emit(); line++; } | <<C>> /^#.*/: { emit(); } | <<PROP>> lexeme class MainKeywords: { return ?lexeme; } | <<PROP>> lexeme class SepKeywords: { return ?lexeme; } | <<PROP>> QUARK_TOK: { return QUARK_TOK; } | <<PROP>> BIGINT_TOK: { return BIGINT_TOK; } | <<PROP>> REGEXP_TOK: { return REGEXP_TOK; } | <<PROP>> PUNCTUATIONS: { return lexbuf[0]; } | <<PROP>> /[ \t\014]/: { /* skip */ } | <<PROP>> /(\/\/.*)?\n/: { line++; } | /\/\*/: { emit(); set_context(COMMENT); } | <<COMMENT>> /\*\//: { emit(); set_context(PROP); } | <<COMMENT>> /\n/: { emit(); line++; } | <<COMMENT>> /./: { emit(); } | /./: { error("%Lillegal character %c\n", lexbuf[0]); } } }
Here, the lexer is partitioned in multiple lexical contexts: context
C
deals with C++ code while context PROP
deals with
Prop extensions. The special context COMMENT
is used
to parse /* */
delimited comments. Contexts are changed
using the set_context
method defined in class LexerBuffer
.
The special variable ?lexeme
can be used within a rule
that matches a lexeme class. For example, within the rule
| <<PROP>> lexeme class MainKeywords: { return ?lexeme; }the variable
?lexeme
is bound to
the token ."rewrite"
if the string ``rewrite'' is matched;
it is bound to the token ."inference"
if the string
``inference'' is matched and so on.