Sat, 06 Dec 2008

A grammar for (pseudo) XML

NAME

"Perl 5 to 6" Lesson 20 - A grammar for (pseudo) XML

SYNOPSIS

    grammar XML {
        token TOP   { ^ <xml> $ };
        token xml   { <text> [ <tag> <text> ]* };
        token text {  <-[<>&]>* };
        rule tag   {
            '<'(\w+) <attributes>*
            [
                | '/>'                 # a single tag
                | '>'<xml>'</' $0 '>'  # an opening and a closing tag
            ]
        };
        token attributes { \w+ '="' <-["<>]>* '"' };
    };

So far the focus of these articles has been the Perl 6 language, independently of what has been implemented so far. To show you that it's not a purely fantasy language, and to demonstrate the power of grammars, this lesson shows the development of a grammar that parses basic XML, and that runs with Rakudo.

Please follow the instructions on http://rakudo.org/how-to-get-rakudo/ to obtain and build Rakudo, and try it out yourself.

Our idea of XML

For our purposes XML is quite simple: it consists of plain text and nested tags that can optionally have attributes. So here are few tests for what we want to parse as valid "XML", and what not:

    my @tests = (
        [1, 'abc'                       ],      # 1
        [1, '<a></a>'                   ],      # 2
        [1, '..<ab>foo</ab>dd'          ],      # 3
        [1, '<a><b>c</b></a>'           ],      # 4
        [1, '<a href="foo"><b>c</b></a>'],      # 5
        [1, '<a empty="" ><b>c</b></a>' ],      # 6
        [1, '<a><b>c</b><c></c></a>'    ],      # 7
        [0, '<'                         ],      # 8
        [0, '<a>b</b>'                  ],      # 9
        [0, '<a>b</a'                   ],      # 10
        [0, '<a>b</a href="">'          ],      # 11
        [1, '<a/>'                      ],      # 12
        [1, '<a />'                     ],      # 13
    );

    my $count = 1;
    for @tests -> $t {
        my $s = $t[1];
        my $M = XML.parse($s);
        if !($M  xor $t[0]) {
            say "ok $count - '$s'";
        } else {
            say "not ok $count - '$s'";
        }
        $count++;
    }

This is a list of both "good" and "bad" XML, and a small test script that runs these tests by calling XML.parse($string). By convention the rule that matches what the grammar should match is named TOP.

(As you can see from test 1 we don't require a single root tag, but it would be trivial to add this restriction).

Developing the grammar

The essence of XML is surely the nesting of tags, so we'll focus on the second test first. Place this at the top of the test script:

    grammar XML {
        token TOP   { ^ <tag> $ }
        token tag   {
            '<' (\w+) '>'
            '</' $0   '>'
        }
    };

Now run the script:

    $ ./perl6 xml-01.pl
    not ok 1 - 'abc'
    ok 2 - '<a></a>'
    not ok 3 - '..<ab>foo</ab>dd'
    not ok 4 - '<a><b>c</b></a>'
    not ok 5 - '<a href="foo"><b>c</b></a>'
    not ok 6 - '<a empty="" ><b>c</b></a>'
    not ok 7 - '<a><b>c</b><c></c></a>'
    ok 8 - '<'
    ok 9 - '<a>b</b>'
    ok 10 - '<a>b</a'
    ok 11 - '<a>b</a href="">'
    not ok 12 - '<a/>'
    not ok 13 - '<a />'

So this simple rule parses one pair of start tag and end tag, and correctly rejects all four examples of invalid XML.

The first test should be easy to pass as well, so let's try this:

   grammar XML {
       token TOP   { ^ <xml> $ };
       token xml   { <text> | <tag> };
       token text  { <-[<>&]>*  };
       token tag   {
           '<' (\w+) '>'
           '</' $0   '>'
       }
    };

(Remember, <-[...]> is a negated character class.)

And run it:

    $ ./perl6 xml-03.pl
    ok 1 - 'abc'
    not ok 2 - '<a></a>'
    (rest unchanged)

Why in the seven hells did the second test stop working? The answer is that Rakudo doesn't do longest token matching yet (update 2013-01: it does now), but matches sequentially. <text> matches the empty string (and thus always), so <text> | <tag> never even tries to match <tag>. Reversing the order of the two alternations would help.

But we don't just want to match either plain text or a tag anyway, but random combinations of both of them:

    token xml   { <text> [ <tag> <text> ]*  };

([...] are non-capturing groups, like (?: ... ) is in Perl 5).

And low and behold, the first two tests both pass.

The third test, ..<ab>foo</ab>dd, has text between opening and closing tag, so we have to allow that next. But not only text is allowed between tags, but arbitrary XML, so let's just call <xml> there:

    token tag   {
        '<' (\w+) '>'
        <xml>
        '</' $0   '>'
    }

    ./perl6 xml-05.pl
    ok 1 - 'abc'
    ok 2 - '<a></a>'
    ok 3 - '..<ab>foo</ab>dd'
    ok 4 - '<a><b>c</b></a>'
    not ok 5 - '<a href="foo"><b>c</b></a>'
    (rest unchanged)

We can now focus on attributes (the href="foo" stuff):

    token tag   {
        '<' (\w+) <attribute>* '>'
        <xml>
        '</' $0   '>'
    };
    token attribute {
        \w+ '="' <-["<>]>* \"
    };

But this doesn't make any new tests pass. The reason is the blank between the tag name and the attribute. Instead of adding \s+ or \s* in many places we'll switch from token to rule, which implies the :sigspace modifier:

    rule tag   {
        '<'(\w+) <attribute>* '>'
        <xml>
        '</'$0'>'
    };
    token attribute {
        \w+ '="' <-["<>]>* \"
    };

Now all tests pass, except the last two:

    ok 1 - 'abc'
    ok 2 - '<a></a>'
    ok 3 - '..<ab>foo</ab>dd'
    ok 4 - '<a><b>c</b></a>'
    ok 5 - '<a href="foo"><b>c</b></a>'
    ok 6 - '<a empty="" ><b>c</b></a>'
    ok 7 - '<a><b>c</b><c></c></a>'
    ok 8 - '<'
    ok 9 - '<a>b</b>'
    ok 10 - '<a>b</a'
    ok 11 - '<a>b</a href="">'
    not ok 12 - '<a/>'
    not ok 13 - '<a />'

These contain un-nested tags that are closed with a single slash /. No problem to add that to rule tag:

    rule tag   {
        '<'(\w+) <attribute>* [
            | '/>'
            | '>' <xml> '</'$0'>'
        ]
    };

All tests pass, we're happy, our first grammar works well.

More hacking

Playing with grammars is much more fun that reading about playing, so here's what you could implement:

plain text can contain entities like &
I don't know if XML tag names are allowed to begin with a number, but the current grammar allows that. You might look it up in the XML specification, and adapt the grammar if needed.
plain text can contain <![CDATA[ ... ]]> blocks, in which xml-like tags are ignored and < and the like don't need to be escaped
Real XML allows a preamble like <?xml version="0.9" encoding="utf-8"?> and requires one root tag which contains the rest (You'd have to change some of the existing test cases)
You could try to implement a pretty-printer for XML by recursively walking through the match object $/. (This is non-trivial; you might have to work around a few Rakudo bugs, and maybe also introduce some new captures).

(Please don't post solutions to this as comments in this blog; let others have the same fun as you had ;-).

Have fun hacking.

MOTIVATION

It's powerful and fun

Categories

Posts in this category