Published on 2009-10-29

Longest-Token Matching in Perl 6

Longest-Token Matching, short LTM, is a concept from Perl 6 regexes which is essential to keep the grammar extensible. It basically states that in an alternation <rule1> | <rule2> | <rule3> | ... the longest alternative wins, if multiple of these rules actually match.

But it's not quite as easy. Matching many rules to the bitter end might be quite expensive, so there's some limit as to what participates in LTM.

Perl 6 regex features are split into two categories: declarative and procedural. The former includes | alternation, & conjunction, <subrule> calls, literals, character classes and quantifiers, the latter includes non-declarative forms of alternation and conjunction (|| and &&), code assertions {...}, backreferences $0 or $<subrule> and other things that can't be represented by traditional regular expressions.

Only the declarative prefixes of each rule participate in LTM. By declarative prefix we mean that each branch of the alternation may consist of declarative and procedural elements, and we start from the left of each branch, and take all declarative atoms until we find the first procedural one. All the declarative atoms at the start are the declarative prefix.

 # declarative prefix
 #   +-----+
 #  /       \
 | 'foo' \d+ [ <bar> || <baz> ]
 | \s* [ \w & b ] [ c | d]
 # \                   /
 #  +----------------+
 #  declarative prefix

Subrule calls are notionally inlined for LTM, recursive calls (also by mutual recursion) end the declarative prefix (and are thus considered procedural for this matter).

 token a { 'A' <a>? }            # recursion limits LTM
 token b { 'B' { say "hi" } }    # a closure limits LTM

 token c {
     | <a>   # effective declarative prefix: 'A'
     | <b>   # effective declarative prefix: 'B'

So all the features allowed in declarative prefixes (after inlining, that is) are present in the traditional Regular Expressions (in the sense that computer scientist use it; aka Type-3 languages in the Chomsky hierarchy).

So after all this preamble, we know that

  • branches of alternations are analyzed for declarative prefixes
  • these declarative prefixes are regular
  • the longest declarative prefix wins

When the procedural part of the branch with the longest declarative prefix match fails, and backtracking is allowed, the second-longest is tried, so there might still be some backtracking going on.

There's more to LTM - for example if two declarative prefixes match a string of the same length, the more specific one wins - string literals count as more specific than character classes, and so on.

But this should describe the most important concepts.