| c | matches character c if it is not a meta character |
| e1e2 | matches e1 then e2 |
| . | matches any character except \n |
| \c | matches escape sequence c |
| ^e | matches e at the start of the line |
| e1|e2 | matches e1 or e2 |
| e* | matches zero or more e |
| e+ | matches one or more e |
| e? | matches zero or one e |
| (e) | grouping |
| <<C1, C2 ¼Cn>>e | matches e only if we are in a context from one of C1, C2, ¼Cn |
| {lexeme} | matches lexeme |
The symbols \ [ ] ( ) { } << >> * + . - ? | are meta characters and are
interpreted non-literally. The escape character \
can be prepended to the meta characters if they occur as literals in context.
Precedence-wise, meta characters *, + and ? bind tighter
than juxtaposition. Thus the regular expression
ab* means a(b*). Parenthesis can be used to override
the default precedence.
Character classes are of the form as found in lex: (i) c1-c2 denotes the range of characters from c1 to c2; (ii) single (non-meta) characters denote themselves; (iii) the meta character ^ can be used to negate the set. For example, the regular expression [a-zA-Z][a-zA-Z0-9]* specifies an alphanumeric identifier that must starts with a letter. Similarly, the regular expression [^ \t\n] matches any character except a space, a tab or a newline.
Lexemes are simply abbreviated names given to a regular expression pattern. They act like macros in lex.
While a lexer is scanning, it may be in one of many contexts. Contexts can be used to group a set of related lexical rules; such rules are only applicable when the contexts are active. This makes the lexer behave like a set of DFAs, with the ability to switch between DFAs under programmer control.