Published on 2010-09-01
How To Debug a Perl 6 Grammar
When a programmer starts to learn his craft, he spends a lot of time making small, stupid mistakes that prevent his programs from running. With a bit of practice, he learns how to make fewer errors, and write more runnable code at once.
With grammars, it's the same all over again. In the author's experience, even expert programmers start with silly mistakes when they begin to write grammars. It's just vastly different from writing ordinary code, and requires a similar learning experience.
Here are some instructions that help you to write and debug grammars.
Start with small steps
Start with small steps, and test along the way.
Start with a simple, single parsing rule, and test cases for it. Keep expanding the test cases and the grammar simultaneously. Only add more features when all tests that you expect to pass actually do.
Test rules individually
If you can't understand certain behavior, test rules individually. That way you can figure out if a rule is wrong, wrongly (or never) called, or interacts badly with other rules.
grammar MyGrammar { token TOP { ^ [ <comment> | <chunk> ]* $ } token comment { '#' \N* $$ } token chunk { ^^(\S+) \= (\S+) $$ } } # try to parse the whole thing say ?MyGrammar.parse("#a comment\nfoo = bar"); # 0 # and now one by one say ?MyGrammar.parse("#a comment\n", :rule<comment>); # 1 say ?MyGrammar.parse("foo = bar", :rule<chunk>); # 0
The example above shows a simple grammar that doesn't match a test string,
due to a stupid thinko. The last two lines test the rules individually,
identifying token chunk as the faulty one.
Debug with print or say
Just like ordinary code, you can sprinkle your grammar rules with calls to
say(). You just need to embed them in curly braces, so that they
get executed as ordinary code.
grammar MyGrammar { token chunk { { say "chunk: called" } ^^ { say "chunk: found start of line" } (\S+) { say "chunk: found first identifier: $0" } \= { say "chunk: found =" } (\S+) $$ } } say ?MyGrammar.parse("foo = bar", :rule<chunk>); # output: # # chunk: called # chunk: found start of line # chunk: found fist identifer: foo # 0
You can see that the rule matched the start of the line, and
foo, but not the equals sign. What's between the two? A space.
For which there is no rule to match it. Making chunk a rule
instead of a token fixes this problem.
Remember that backtracking can cause a single block to be executed multiple times, even if not part of a quantified construct.
$ perl6 -e '"aabcd" ~~ /^ (.*) { say $0 } b /'
aabcd
aabc
aab
aa
Be careful with backtracking control
Programmers who are familiar with Perl 5 regexes or similar regex engines are used to backtracking: If the "most obvious" way to match a string does not work out, the regex engine tries all possible other ways.
This is what many expect for small regexes, but when writing a grammar that has several nesting levels, it can be deeply confusing.
Most day-to-day parsing problems can be formulated in a way that requires little or no backtracking, and it should be done that way, both for efficiency and programmer sanity.
Some constructs are easier with backtracking, but if you use them, embed
them in a non-backtracking rule (ie token or rule,
which have the :ratchet modifier implicitly set):
rule verbatim { '[%' ~ '%]' verbatim # switch on backtracking from here on, # to the bottom of the rule :!ratchet .*? '[%' endverbatim '%]' }
This uses backtracking inside the regex, but once it found a possible
match, it will never try another, because here verbatim is a
rule, which (like token) suppresses backtracking into itself.
Rakudo's DEBUG rule
Rakudo has a DEBUG rule (which is not specified, so you shouldn't count on its availability in other implementations). If you call with a true argument, it enables tracing of subrule calls
grammar MyGrammar { token TOP { <?DEBUG(1)> ^ [ <comment> | <chunk> ]* $ } token comment { '#' \N* $$ } rule chunk { ^^ (\S+) \= (\S+) $$ } } MyGrammar.parse("#foobar\nfoo = bar");
Output (line numbers added for later reference):
1 1283276995.943950 0/0 START comment 2 1283276995.944047 0/0 PASS comment at pos=7 3 1283276995.944158 7/0 START comment 4 1283276995.944201 7/0 FAIL comment 5 1283276995.944308 7/0 START chunk 6 1283276995.944392 8/1 START 7 1283276995.944433 8/1 PASS at pos=11 8 1283276995.944523 14/1 START 9 1283276995.944555 14/1 PASS at pos=17 10 1283276995.944640 7/0 PASS chunk at pos=17 11 1283276995.944721 17/1 START comment 12 1283276995.944757 17/1 FAIL comment 13 1283276995.944822 17/1 START chunk 14 1283276995.944876 17/1 FAIL chunk
The first column (excluding the line number, which isn't part of the actual output) is a UNIX timestamp, the difference to the previous one tells the time between two regex events. (If you use an release of Rakudo older than 2010.08, you'll not get the timestamps).
The first number in the second column shows the starting position of the rule, the second the line number. (Where position means number of characters from start of the string; both numbers start at 0). The third column is one of START, PASS or FAIL. START is printed when a rule is entered, PASS or FAIL when it left either successfully, or without a match.
The fourth column, if any, shows the name of the rule. The empty rule names
in the example output stems from the two (\S+) groups.
PASSing rules have and additional at pos=$number marker, which
shows the end position of the matching rule.