Published on 2010-09-01

How To Debug a Perl 6 Grammar

When a programmer starts to learn his craft, he spends a lot of time making small, stupid mistakes that prevent his programs from running. With a bit of practice, he learns how to make fewer errors, and write more runnable code at once.

With grammars, it's the same all over again. In the author's experience, even expert programmers start with silly mistakes when they begin to write grammars. It's just vastly different from writing ordinary code, and requires a similar learning experience.

Here are some instructions that help you to write and debug grammars.

Start with small steps

Start with small steps, and test along the way.

Start with a simple, single parsing rule, and test cases for it. Keep expanding the test cases and the grammar simultaneously. Only add more features when all tests that you expect to pass actually do.

Test rules individually

If you can't understand certain behavior, test rules individually. That way you can figure out if a rule is wrong, wrongly (or never) called, or interacts badly with other rules.

grammar MyGrammar {
    token TOP {
        ^ [ <comment> | <chunk> ]* $
    }

    token comment {
        '#' \N* $$
    }
    token chunk {
        ^^(\S+) \= (\S+) $$
    }
}

# try to parse the whole thing
say ?MyGrammar.parse("#a comment\nfoo = bar");          # 0
# and now one by one
say ?MyGrammar.parse("#a comment\n", :rule<comment>);   # 1
say ?MyGrammar.parse("foo = bar", :rule<chunk>);        # 0

The example above shows a simple grammar that doesn't match a test string, due to a stupid thinko. The last two lines test the rules individually, identifying token chunk as the faulty one.

Debug with print or say

Just like ordinary code, you can sprinkle your grammar rules with calls to say(). You just need to embed them in curly braces, so that they get executed as ordinary code.

grammar MyGrammar {
    token chunk {
        { say "chunk: called" }
        ^^
        { say "chunk: found start of line" }
        (\S+) 
        { say "chunk: found first identifier: $0" }
        \= 
        { say "chunk: found =" }
        (\S+) $$
    }
}

say ?MyGrammar.parse("foo = bar", :rule<chunk>);

# output:
#
# chunk: called
# chunk: found start of line
# chunk: found fist identifer: foo
# 0

You can see that the rule matched the start of the line, and foo, but not the equals sign. What's between the two? A space. For which there is no rule to match it. Making chunk a rule instead of a token fixes this problem.

Remember that backtracking can cause a single block to be executed multiple times, even if not part of a quantified construct.

$ perl6 -e '"aabcd" ~~ /^ (.*) { say $0 } b /'
aabcd
aabc
aab
aa

Be careful with backtracking control

Programmers who are familiar with Perl 5 regexes or similar regex engines are used to backtracking: If the "most obvious" way to match a string does not work out, the regex engine tries all possible other ways.

This is what many expect for small regexes, but when writing a grammar that has several nesting levels, it can be deeply confusing.

Most day-to-day parsing problems can be formulated in a way that requires little or no backtracking, and it should be done that way, both for efficiency and programmer sanity.

Some constructs are easier with backtracking, but if you use them, embed them in a non-backtracking rule (ie token or rule, which have the :ratchet modifier implicitly set):

rule verbatim {
    '[%' ~ '%]' verbatim
    # switch on backtracking from here on,
    # to the bottom of the rule
    :!ratchet
    .*? '[%' endverbatim '%]'
}

This uses backtracking inside the regex, but once it found a possible match, it will never try another, because here verbatim is a rule, which (like token) suppresses backtracking into itself.

Rakudo's DEBUG rule

Rakudo has a DEBUG rule (which is not specified, so you shouldn't count on its availability in other implementations). If you call with a true argument, it enables tracing of subrule calls


grammar MyGrammar {
    token TOP {
        <?DEBUG(1)>
        ^ [ <comment> | <chunk> ]* $
    }
    token comment { '#' \N* $$ }
    rule chunk { ^^ (\S+) \= (\S+) $$ }
}

MyGrammar.parse("#foobar\nfoo = bar");

Output (line numbers added for later reference):

 1 1283276995.943950 0/0 START    comment
 2 1283276995.944047 0/0 PASS     comment at pos=7
 3 1283276995.944158 7/0 START    comment
 4 1283276995.944201 7/0 FAIL     comment
 5 1283276995.944308 7/0 START    chunk
 6 1283276995.944392 8/1 START    
 7 1283276995.944433 8/1 PASS      at pos=11
 8 1283276995.944523 14/1 START    
 9 1283276995.944555 14/1 PASS      at pos=17
10 1283276995.944640 7/0 PASS     chunk at pos=17
11 1283276995.944721 17/1 START    comment
12 1283276995.944757 17/1 FAIL     comment
13 1283276995.944822 17/1 START    chunk
14 1283276995.944876 17/1 FAIL     chunk

The first column (excluding the line number, which isn't part of the actual output) is a UNIX timestamp, the difference to the previous one tells the time between two regex events. (If you use an release of Rakudo older than 2010.08, you'll not get the timestamps).

The first number in the second column shows the starting position of the rule, the second the line number. (Where position means number of characters from start of the string; both numbers start at 0). The third column is one of START, PASS or FAIL. START is printed when a rule is entered, PASS or FAIL when it left either successfully, or without a match.

The fourth column, if any, shows the name of the rule. The empty rule names in the example output stems from the two (\S+) groups.

PASSing rules have and additional at pos=$number marker, which shows the end position of the matching rule.