next up previous contents index
Next: Lexeme Up: Lexer Specification Previous: Lexer Specification

Regular expressions

  Figure 1 describes the set of current supported regular expressions in Prop. The syntax is similar to what is found in lex, or egrep.

 

c matches character c if it is not a meta character
e1e2 matches e1 then e2
. matches any character except \n
\c matches escape sequence c
^e matches e at the start of the line
e1|e2 matches e1 or e2
e* matches zero or more e
e+ matches one or more e
e? matches zero or one e
(e) grouping
<<C1, C2 ¼Cn>>e matches e only if we are in a context from one of C1, C2, ¼Cn
{lexeme} matches lexeme

Figure 1:   Regular expressions.

The symbols \ [ ] ( ) { } << >> * + . - ? | are meta characters and are interpreted non-literally. The escape character \ can be prepended to the meta characters if they occur as literals in context.

Precedence-wise, meta characters *, + and ? bind tighter than juxtaposition. Thus the regular expression ab* means a(b*). Parenthesis can be used to override the default precedence.

Character classes are of the form as found in lex: (i) c1-c2 denotes the range of characters from c1 to c2; (ii) single (non-meta) characters denote themselves; (iii) the meta character ^ can be used to negate the set. For example, the regular expression [a-zA-Z][a-zA-Z0-9]* specifies an alphanumeric identifier that must starts with a letter. Similarly, the regular expression [^ \t\n] matches any character except a space, a tab or a newline.

Lexemes are simply abbreviated names given to a regular expression pattern. They act like macros in lex.

While a lexer is scanning, it may be in one of many contexts. Contexts can be used to group a set of related lexical rules; such rules are only applicable when the contexts are active. This makes the lexer behave like a set of DFAs, with the ability to switch between DFAs under programmer control.



Allen Leung
Mon Apr 7 14:33:55 EDT 1997