Regular expressions

Next: Lexeme Up: Lexer Specification Previous: Lexer Specification

Regular expressions

Figure 1 describes the set of current supported regular expressions in Prop. The syntax is similar to what is found in lex, or egrep.

c	matches character c if it is not a meta character
e₁e₂	matches e₁ then e₂
`.`	matches any character except `\n`
`\`c	matches escape sequence c
`^`e	matches e at the start of the line
e₁`\|`e₂	matches e₁ or e₂
e`*`	matches zero or more e
e`+`	matches one or more e
e`?`	matches zero or one e
`(`e`)`	grouping
`<<`C₁, C₂ ĽC_n`>>`e	matches e only if we are in a context from one of C₁, C₂, ĽC_n
`{`lexeme`}`	matches lexeme

Figure 1: Regular expressions.

The symbols \ [ ] ( ) { } << >> * + . - ? | are meta characters and are interpreted non-literally. The escape character \ can be prepended to the meta characters if they occur as literals in context.

Precedence-wise, meta characters *, + and ? bind tighter than juxtaposition. Thus the regular expression ab* means a(b*). Parenthesis can be used to override the default precedence.

Character classes are of the form as found in lex: (i) c₁-c₂ denotes the range of characters from c₁ to c₂; (ii) single (non-meta) characters denote themselves; (iii) the meta character ^ can be used to negate the set. For example, the regular expression [a-zA-Z][a-zA-Z0-9]* specifies an alphanumeric identifier that must starts with a letter. Similarly, the regular expression [^ \t\n] matches any character except a space, a tab or a newline.

Lexemes are simply abbreviated names given to a regular expression pattern. They act like macros in lex.

While a lexer is scanning, it may be in one of many contexts. Contexts can be used to group a set of related lexical rules; such rules are only applicable when the contexts are active. This makes the lexer behave like a set of DFAs, with the ability to switch between DFAs under programmer control.

Allen Leung
Mon Apr 7 14:33:55 EDT 1997