Algebraic datatypes and pattern matching

Next: Tree rewriting Up: The Language Previous: The Language

Algebraic datatypes and pattern matching

Algebraic datatypes and pattern matching in Prop are borrowed from modern typed functional languages such as ML, Haskell, and Hope. In these languages, user defined datatypes are tree-like tagged variants that may be structurally decomposed using pattern matching constructs. In Prop , we can also view algebraic datatypes in the same way. However, there are a few important departures:

Datatype values can be tested for object identity using ==, which in turn can be used to determine structural sharing. Furthermore, destructive assignments can be performed on datatypes. Thus in general datatypes in Prop are graphs(or regular trees).
Datatypes assignment uses a reference semantics instead of the value semantics adopted in C++ .
The same datatype can be viewed in many different ways, depending on the context. For example, for general program manipulation, datatypes can be seen as simply labeled attributed trees. For rewriting, they can be viewed as ground terms. And during inference, datatypes can be seen as tuples.

In the following we shall give a brief overview of the pattern matching features of Prop . For most users of modern declarative languages many of these features are already familiar constructs.

A brief tour on pattern matching

Algebraic datatypes are specified using datatype definitions, which define the inductive structure of one of more types using a tree-grammar like syntax. In addition, pretty printers, lexical scanners, parsers, persistence I/O methods and garbage collection inferences can also be specified with additional options in the same construct. When a datatype is declared, the following operations are implicitly defined by the datatype compiler: (1) the constructors for all the variants of a type; (2) the identity test operator ==, and the assignment operator = for this type; and (3) the member functions needed to decompose a datatype value during pattern matching.

We'll select the internals of a compiler for a simplified imperative language as the running example in this paper. Suppose that in this language an expression is composed of identifiers, integer constants and the four arithmetic operators. Then the structure of the abstract syntax tree can be specified as follows:

   datatype Exp = INT (int)
                | ID  (const char *)
                | ADD (Exp, Exp)
                | SUB (Exp, Exp)
                | MUL (Exp, Exp)
                | DIV (Exp, Exp)
                ;

The abstract syntax of an expression such as a * b - 17 can be constructed directly in a prefix syntax, directly mirroring that of the definition. The Prop datatype compiler will automatically generate a C++ class hierarchy to represent the variants of type Exp. Datatype constructor functions(not to be mistaken with C++ 's class constructors) will also be automatically generated using the same names as the variants.

   Exp formula = ADD(MUL(ID("a"),ID("b")),INT(17));

Datatype values can be decomposed using the match statement, which can be seen as a generalization of C 's switch construct. Pattern matching is a combination of conditional branching and value binding. For example, a typical evaluation function for the type Exp can be written as in the following example. Notice that each arm of a case is in fact a pattern(with optional variables) mirroring the syntax of a datatype. The pattern variables(written with the prefix ? in the sequel) of a matching arm is bound to the value of the matching value, which can be subsequently referenced and modified in the action of an arm.

int eval (Exp e, const map<const char *, int>& env)
{  match (e)
   {  case INT ?i:        return ?i;
      case ID  ?id:       return env[?id];
      case ADD (?e1,?e2): return eval(?e1,env) + eval(?e2,env);
      case SUB (?e1,?e2): return eval(?e1,env) - eval(?e2,env);
      case MUL (?e1,?e2): return eval(?e1,env) * eval(?e2,env);
      case DIV (?e1,?e2): return eval(?e1,env) / eval(?e2,env);
   }
}

Why object-orientedness is insufficient

Although a comparable evaluation function can be written in object oriented style using late binding, as in below, in general pattern matching is much more powerful than late binding in C++ , which only allows dispatching based on the type of one receiver.

  // Class definitions
  class Exp {
  public:
     virtual int eval(const map<const char *, int>& env) const = 0;
  };
  class INT : Exp {
     int i;
  public:
     int eval(const map<const char *, int>& env);
  };
  class ID : Exp {
     const char * id
  public:
     int eval(const map<const char *, int>& env);
  };
  ...

  // Member functions
  int INT::eval(const map<const char *, int>& env) const { return i; }
  int ID ::eval(const map<const char *, int>& env) const { return id; }
  int ADD::eval(const map<const char *, int>& env) const
     { return e1->eval(env) + e2->eval(env); }
  int SUB::eval(const map<const char *, int>& env) const
     { return e1->eval(env) - e2->eval(env); }
  int MUL::eval(const map<const char *, int>& env) const
     { return e1->eval(env) * e2->eval(env); }
  int DIV::eval(const map<const char *, int>& env) const
     { return e1->eval(env) / e2->eval(env); }

For example, in the following function we use nested patterns, non-linear patterns (i.e. patterns with multiple occurrences of a pattern variable), and guards to perform algebraic simplification of an expression. Although the patterns are relative simple in this example, in general arbitrarily complex patterns may be used.

Exp simplify (Exp redex)
{  // recursive traversal code omitted ...

   // match while repeats the matching process
   // until no more matches are found.
   match while (redex)
   {  ADD(INT 0,  ?x):     { redex = ?x; }
   |  ADD(INT ?x, INT ?y): { redex = INT(?x+?y); }
   |  ADD(?x,     INT 0)   { redex = ?x; }
   |  SUB(?x,     INT 0):  { redex = ?x; }
   |  SUB(?x,     ?x):     { redex = INT(0); }
   |  SUB(INT ?x, INT ?y): { redex = INT(?x-?y); }
   |  MUL(INT 0,  ?x):     { redex = INT(0); }
   |  MUL(?x,     INT 0):  { redex = INT(0); }
   |  DIV(?x,     ?x):     { redex = INT(1); }
        // don't divide by zero.
   |  DIV(INT ?x, INT ?y) | ?y != 0: { redex = INT(?x/?y); }
   |  ...
   }
   return redex;
}

Pattern matching in Prop is also not restricted to one datatype at a time. In the following example, we use matching on multiple values to define equality on expressions inductively. For variety, we'll use the fun variant of match, which defines a function in rule form. Notice that the last case of the match set uses wild cards _ to catch all the other non-equal combinations. Since C++ does not provide multiple dispatching, implementing binary (or n-ary) operations on variant datatypes are in general cumbersome and verbose in object-oriented style. In contrast, using an applicative pattern matching style many manipulations and transformations on variant datatypes with tree-like or graph-like structure can be expressed succinctly.

fun equal INT ?i,     INT ?j: bool: { return ?i == ?j; }
  | equal ID  ?a,     ID  ?b:       { return strcmp(a,b) == 0; }
  | equal ADD(?a,?b), ADD(?c,?d):   { return equal(?a,?c) && equal(?b,?d); }
  | equal SUB(?a,?b), SUB(?c,?d):   { return equal(?a,?c) && equal(?b,?d); }
  | equal MUL(?a,?b), MUL(?c,?d):   { return equal(?a,?c) && equal(?b,?d); }
  | equal DIV(?a,?b), DIV(?c,?d):   { return equal(?a,?c) && equal(?b,?d); }
  | equal _,          _:            { return false; }
  ;

More examples

As another example, we can specify the term structure of well-formed formulas in proposition calculus as follows. Notice that the constructors F and T are nullary.

   datatype Wff = F
                | T
                | Var     (const char *)
                | And     (Wff, Wff)
                | Or      (Wff, Wff)
                | Not     (Wff)
                | Implies (Wff, Wff)
                ;

Datatypes that are parametrically polymorphic, such as lists and trees, can be defined by parameterizing them with respect to one of more types. For example, both lists and tree below are parametric on one type argument T.

   datatype List<T> = nil
                    | cons(T, List<T>);
   datatype Tree<T> = empty
                    | leaf(T)
                    | node(Tree<T>, T, Tree<T>);

   List<int> primes = cons(2,cons(3,cons(5,cons(7,nil))));
   List<int> more_primes = cons(11,cons(13,primes));
   Tree<char *> names = node(leaf("Church"),"Godel",empty);

As a programming convenience, Prop has a set of built-in list-like constructors syntactic forms. Unlike languages such as ML, however, these forms are not bound to any specific list types. Instead, it is possible for the user to use these forms on any datatypes with a natural binary cons operator and a nullary nil constructor. For instance, the previous list datatype can be redefined as follows:

   datatype List<T> = #[] | #[ T ... List<T> ];
   List<int> primes = #[ 2, 3, 5, 7 ];
   List<int> more_primes = #[ 11, 13 ... primes ];
   List<char *> names = #[ "Church", "Godel", "Turing", "Curry" ];

   template <class T>
      List<T> append (List<T> a, List<T> b)
      {  match (a)
         {  case #[]:          return b;
            case #[hd ... tl]: return #[hd ... append(tl,b)];
         }
      }

Notice that the empty list is written as #[], while cons(a,b) is written as #[ a ... b ]. An expression of the special form #[a, b, c], for instance, is simple syntactic sugar for repeated application of the cons operator, i.e.

    #[a, b, c] == #[a ... #[ b ... #[ c ... #[] ] ] ].

List-like special forms are not limited to datatypes with only two variants. For example, we can define a datatype similar in structure to S-expressions in Lisp or Scheme. Here's how such as datatype may be defined(for simplicity, we'll use a string representation for atoms instead of a more efficient method):

   datatype Sexpr = INT    (int)
                  | REAL   (double)
                  | STRING (char *)
                  | ATOM   (const char *)
                  | #()
                  | #( Sexpr ... Sexpr )
   where type Atom = Sexpr  // synonym for Sexpr
   ;

With this datatype specification in place, we can construct values of type Sexpr in a syntax close to that of Lisp. For example, we can define lambda expressions corresponding to the combinators I, K and S as follows:

   Atom LAMBDA = ATOM("LAMBDA");
   Atom f      = ATOM("f");
   Atom x      = ATOM("x");
   Atom y      = ATOM("y");
   Atom NIL    = #();

   Sexpr I = #(LAMBDA, #(x), x);
   Sepxr K = #(LAMBDA, #(x), #(LAMBDA, #(y), x));
   Sepxr S = #(LAMBDA, #(f),
                #(LAMBDA, #(x),
                   #(LAMBDA, #(y), #(#(f,x), #(g,x)))));

Similar to list-like forms, vector-like forms are also available. This addresses one of the flaws of the C++ language, which lacks first class arrays. Vectors are simply homogeneous arrays whose sizes are fixed and are determined at creation time. Random access within vectors can be done in constant time. Unlike lists, however, the prepending operation is not supported. Vectors literals are delimited with the composite brackets (| ... |), [| ... |], or {| ... |}. In the following example the datatype Exp uses vectors to represent the coefficients of the polynomials:

   datatype Vector<T> = (| T |);
   datatype Exp = Polynomial (Var, Vector<int>)
                | Functor (Exp, Vector<Exp>)
                | Atom (Var)
                | ...
   where type Var = const char *;
   Exp formula = Polynomial("X", (| 1, 2, 3 |));

Pattern laws

Commonly used patterns can be given synonyms so that they can be readily reused without undue repetition. This can be accomplished by defining pseudo datatype constructors to stand for common patterns using datatype law definitions. For example, the following set of laws define some commonly used special forms for a Lisp-like language using the previously defined Sexpr datatype.

   datatype law Lambda(x,e)  = #(ATOM "LAMBDA", #(x), e)
              | Quote(x)     = #(ATOM "QUOTE", x)
              | If(a,b,c)    = #(ATOM "IF", a, b, c)
              | Nil          = #()
              | ProgN(exprs) = #(ATOM "PROGN" ... exprs)
              | SpecialForm  = #(ATOM ("LAMBDA" || "IF" ||
                                       "PROGN" || "QUOTE") ... _)
              | Call(f,args) = ! SpecialForm && #(f ... args)
              ;

Notice that the pattern SpecialForm is meant to match all special forms in our toy language: i.e. lambdas, ifs, progn's and quotes. The pattern disjunction connective || is used to link these forms together. Since we'd like the Call pattern to match only if the S-expression is not a special form, we use the pattern negation and conjunction operators, ! and && are used to screen out special forms. With these definitions in place, an interpreter for our language can be written thus:

   Sexpr eval (Sexpr e)
   {  match (e)
      {  Call(?f,?args):     { /* perform function call */ }
      |  Lambda(?x,?e):      { /* construct closure */ }
      |  If(?e,?then,?else): { /* branching */ }
      |  Quote(?x):          { return ?x; }
      |  ...:                { /* others */ }
      }
   }

As an interesting note, the special form pattern can also be rewritten using regular expression string matching, as in the following:

 datatype law SpecialForm = #(ATOM /LAMBDA|IF|PROGN|QUOTE/ ... _)

Variants of match

Besides the usual plain pattern matching, a few variants of the match construct are offered. We'll briefly enumerate a few of these:

The matchall construct is a variant of match that executes all matching rules(instead of just the first one) in sequence.
Each rule of a match statement can have associated cost expressions. Instead of selecting the first matching rule to execute as in the default, all matching rules are considered and the rule with the least cost is executed. Ties are broken by choosing the rule that comes first lexically. For example:
```
   match (ir)
   {  ADD(LOAD(?r0),?r1) \ e1: { ... }
   |  ADD(?r0,LOAD(?r0)) \ e2: { ... }
   |  ADD(?r0, ?r1)      \ e3: { ... }
   |  ...
   }
```
A match construct can be modified with the while modifier, as in the following example. A match modified thus is repeatedly matched until none of the patterns are applicable. For example, the following routine uses a match while statement to traverse to the leftmost leaf.
```
   template <class T>
   Tree<T> left_most_leaf(Tree<T> tree)
   {  match while (tree)
      {  case node(t,_,_):  tree = t;
      }
      return tree;
   }
```

Finally, the matchscan variant of match can be used to perform string matching on a stream. For example, a simple lexical scanner can be written as follows.

   int lexer (istream& in)
   {  matchscan while (in)
      {  /if/:                     { return IF; }
      |  /else/:                   { return ELSE; }
      |  /while/:                  { return WHILE; }
      |  /for/:                    { return FOR; }
      |  /break/:                  { return BREAK; }
      |  /[0-9]+/:                 { return INTEGER; }
      |  /[a-zA-Z_][a-zA-Z_0-9]*/: { return IDENTIFIER; }
      |  /[ \t]+/:                 { /* skip spaces */ }
      |  ... // other rules
      }
   }

Next: Tree rewriting Up: The Language Previous: The Language

Allen Leung
Wed Mar 6 20:55:43 EST 1996