c | matches character c if it is not a meta character |
e1e2 | matches e1 then e2 |
. | matches any character except \n |
\c | matches escape sequence c |
^e | matches e at the start of the line |
e1|e2 | matches e1 or e2 |
e* | matches zero or more e |
e+ | matches one or more e |
e? | matches zero or one e |
(e) | grouping |
<<C1, C2 ¼Cn>>e | matches e only if we are in a context from one of C1, C2, ¼Cn |
{lexeme} | matches lexeme |
The symbols \ [ ] ( ) { } << >> * + . - ? |
are meta characters and are
interpreted non-literally. The escape character \
can be prepended to the meta characters if they occur as literals in context.
Precedence-wise, meta characters *
, +
and ?
bind tighter
than juxtaposition. Thus the regular expression
ab*
means a(b*)
. Parenthesis can be used to override
the default precedence.
Character classes are of the form as found in lex: (i) c1-c2 denotes the range of characters from c1 to c2; (ii) single (non-meta) characters denote themselves; (iii) the meta character ^ can be used to negate the set. For example, the regular expression [a-zA-Z][a-zA-Z0-9]* specifies an alphanumeric identifier that must starts with a letter. Similarly, the regular expression [^ \t\n] matches any character except a space, a tab or a newline.
Lexemes are simply abbreviated names given to a regular expression pattern. They act like macros in lex.
While a lexer is scanning, it may be in one of many contexts. Contexts can be used to group a set of related lexical rules; such rules are only applicable when the contexts are active. This makes the lexer behave like a set of DFAs, with the ability to switch between DFAs under programmer control.