Grammar tutorial
An introduction to grammars
Before we start
Why grammars?
Grammars parse strings and return data structures from those strings. Grammars can be used to prepare a program for execution, to determine if a program can run at all (if it's a valid program), to break down a web page into constituent parts, or to identify the different parts of a sentence, among other things.
When would I use grammars?
If you have strings to tame or interpret, grammars provide the tools to do the job.
The string could be a file that you're looking to break into sections; perhaps a protocol, like SMTP, where you need to specify which "commands" come after what user-supplied data; maybe you're designing your own domain specific language. Grammars can help.
The broad concept of grammars
Regular expressions (Regexes) work well for finding patterns in strings. However, for some tasks, like finding multiple patterns at once, or combining patterns, or testing for patterns that may surround strings regular expressions, alone, are not enough.
When working with HTML, you could define a grammar to recognize HTML tags, both the opening and closing elements, and the text in between. You could then organize these elements into data structures, such as arrays or hashes.
Getting more technical
The conceptual overview
Grammars are a special kind of class. You declare and define a grammar exactly as you would any other class, except that you use the grammar keyword instead of class.
As such classes, grammars are made up of methods that define a regex, a token, or a rule. These are all varieties of different types of match methods. Once you have a grammar defined, you call it and pass in a string for parsing.
my = G.parse();
Now, you may be wondering, if I have all these regexes defined that just return their results, how does that help with parsing strings that may be ahead or backwards in another string, or things that need to be combined from many of those regexes... And that's where grammar actions come in.
For every "method" you match in your grammar, you get an action you can use to act on that match. You also get an overarching action that you can use to tie together all your matches and to build a data structure. This overarching method is called TOP
by default.
The technical overview
As already mentioned, grammars are declared using the grammar keyword and its "methods" are declared with regex, or token, or rule.
Regex methods are slow but thorough, they will look back in the string and really try.
Token methods are faster than regex methods and ignore whitespace.
Rule methods are the same as token methods except whitespace is not ignored.
When a method (regex, token or rule) matches in the grammar, the string matched is put into a match object and keyed with the same name as the method.
If you were to use my $match = G.parse($string)
and your string started with 'clever_text_keyword', you would get a match object back that contained 'clever_text_keyword' keyed by the name of <thingy>
in your match object. For instance:
my = G.parse("Þor is mighty");say .perl; # OUTPUT: «Match.new(made => Any, pos => 13, orig => "Þor is mighty",...»say $/.perl; # OUTPUT: «Match.new(made => Any, pos => 13, orig => "Þor is mighty",...»say $/<thingy>.perl;# OUTPUT: «Match.new(made => Any, pos => 3, orig => "Þor is mighty", hash => Map.new(()), list => (), from => 0)»
The two first output lines show that $match
contains a Match
objects with the results of the parsing; but those results are also assigned to the match variable $/
. Either match object can be keyed, as indicated above, by thingy
to return the match for that particular token
.
The TOP
method (whether regex, token, or rule) is the overarching pattern that must match everything (by default). If the parsed string doesn't match the TOP regex, your returned match object will be empty (Nil
).
As you can see above, in TOP
, the <thingy>
token is mentioned. The <thingy>
is defined on the next line. That means that 'clever_text_keyword'
must be the first thing in the string, or the grammar parse will fail and we'll get an empty match. This is great for recognizing a malformed string that should be discarded.
Learning by example - a REST contrivance
Let's suppose we'd like to parse a URI into the component parts that make up a RESTful request. We want the URIs to work like this:
The first part of the URI will be the "subject", like a part, or a product, or a person.
The second part of the URI will be the "command", the standard CRUD functions (create, retrieve, update, or delete).
The third part of the URI will be arbitrary data, perhaps the specific ID we'll be working with or a long list of data separated by "/"'s.
When we get a URI, we'll want 1-3 above to be placed into a data structure that we can easily work with (and later enhance).
So, if we have "/product/update/7/notify", we would want our grammar to give us a match object that has a subject
of "product", a command
of "update", and data
of "7/notify".
We'll start by defining a grammar class and some match methods for the subject, command, and data. We'll use the token declarator since we don't care about whitespace.
So far, this REST grammar says we want a subject that will be just word characters, a command that will be just word characters, and data that will be everything else left in the string.
Next, we'll want to arrange these matching tokens within the larger context of the URI. That's what the TOP method allows us to do. We'll add the TOP method and place the names of our tokens within it, together with the rest of the patterns that makes up the overall pattern. Note how we're building a larger regex from our named regexes.
With this code, we can already get the three parts of our RESTful request:
my = REST.parse('/product/update/7/notify');say ;# OUTPUT: «「/product/update/7/notify」# subject => 「product」# command => 「update」# data => 「7/notify」»
The data can be accessed directly by using $match<subject>
or $match<command>
or $match<data>
to return the values parsed. They each contain match objects that you can work further with, such as coercing into a string ( $match<command>.Str
).
Adding some flexibility
So far, the grammar will handle retrieves, deletes and updates. However, a create command doesn't have the third part (the data portion). This means the grammar will fail to match if we try to parse a create URI. To avoid this, we need to make that last data position match optional, along with the '/' preceding it. This is accomplished by adding a question mark to the grouped '/' and data components of the TOP token, to indicate their optional nature, just like a normal regex.
So, now we have:
my = REST.parse('/product/create');say <subject>, <command>;# OUTPUT: «「product」「create」»
Next, assume that the URIs will be entered manually by a user and that the user might accidentally put spaces between the '/'s. If we wanted to accommodate for this, we could replace the '/'s in TOP with a token that allowed for spaces.
my = REST.parse('/ product / update /7 /notify');say ;# OUTPUT: «「/ product / update /7 /notify」# slash => 「/ 」# subject => 「product」# slash => 「 / 」# command => 「update」# slash => 「 /」# data => 「7 /notify」»
We're getting some extra junk in the match object now, with those slashes. There's techniques to clean that up that we'll get to later.
Inheriting from a grammar
Since grammars are classes, they behave, OOP-wise, in the same way as any other class; specifically, they can inherit from base classes that include some tokens or rules, this way:
is Letters is Quote-Quotesis Letters is Quote-Othermy = q{"enhanced"};my = Quoted-Quotes.parse();say ;#OUTPUT:#「"enhanced"」# quote => 「"」# letters => 「enhanced」#quote => 「"」= "|barred|";= Quoted-Other.parse();say ;#OUTPUT:#quote => 「|」#letters => 「barred」#quote => 「|」
This example uses multiple inheritance to compose two different grammars by varying the rules that correspond to quotes
. In this case, besides, we are rather using composition than inheritance, so we could use Roles instead of inheritance.
does Letters does Quote-Quotesdoes Letters does Quote-Other
Will output exactly the same as the code above. Symptomatic of the difference between Classes and Roles, a conflict like defining token quote
twice using Role composition will result in an error:
does Letters does Quote-Quotes does Quote-Other# OUTPUT: ... Error while compiling ... Method 'quote' must be resolved ...
Adding some constraints
We want our RESTful grammar to allow for CRUD operations only. Anything else we want to fail to parse. That means our "command" above should have one of four values: create, retrieve, update or delete.
There are several ways to accomplish this. For example, you could change the command method:
# …becomes…
For a URI to parse successfully, the second part of the string between '/'s must be one of those CRUD values, otherwise the parsing fails. Exactly what we want.
There's another technique that provides greater flexibility and improved readability when options grow large: proto-regexes.
To utilize these proto-regexes (multimethods, in fact) to limit ourselves to the valid CRUD options, we'll replace token command
with the following:
proto
The sym
keyword is used to create the various proto-regex options. Each option is named (e.g., sym<update>
), and for that option's use, a special <sym>
token is auto-generated with the same name.
The <sym>
token, as well as other user-defined tokens, may be used in the proto-regex option block to define the specific match condition. Regex tokens are compiled forms and, once defined, cannot subsequently be modified by adverb actions (e.g., :i
). Therefore, as it's auto-generated, the special <sym>
token is useful only where an exact match of the option name is required.
If, for one of the proto-regex options, a match condition occurs, then the whole proto's search terminates. The matching data, in the form of a match object, is assigned to the parent proto token. If the special <sym>
token was employed and formed all or part of the actual match, then it's preserved as a sub-level in the match object, otherwise it's absent.
Using proto-regex like this gives us a lot of flexibility. For example, instead of returning <sym>
, which in this case is the entire string that was matched, we could instead enter our own string, or do other funny stuff. We could do the same with the token subject
method and limit it also to only parsing correctly on valid subjects (like 'part' or 'people', etc.).
Putting our RESTful grammar together
This is what we have for processing our RESTful URIs, so far:
Let's look at various URIs and see how they work with our grammar.
my = ['/product/update/7/notify','/product/create','/item/delete/4'];for -># OUTPUT: «Sub: product Cmd: update Dat: 7/notify# Sub: product Cmd: create Dat: # Sub: item Cmd: delete Dat: 4»
Note that since <data>
matches nothing on the second string, $m<data>
will be Nil
, then using it in string context in the say
function warns.
With just this part of a grammar, we're getting almost everything we're looking for. The URIs get parsed and we get a data structure with the data.
The data token returns the entire end of the URI as one string. The 4 is fine. However from the '7/notify', we only want the 7. To get just the 7, we'll use another feature of grammar classes: actions.
Grammar actions
Grammar actions are used within grammar classes to do things with matches. Actions are defined in their own classes, distinct from grammar classes.
You can think of grammar actions as a kind of plug-in expansion module for grammars. A lot of the time you'll be happy using grammars all by their own. But when you need to further process some of those strings, you can plug in the Actions expansion module.
To work with actions, you use a named parameter called actions
which should contain an instance of your actions class. With the code above, if our actions class called REST-actions, we would parse the URI string like this:
my = REST.parse(, actions => REST-actions.new);# …or if you prefer…my = REST.parse(, :actions(REST-actions.new));
If you name your action methods with the same name as your grammar methods (tokens, regexes, rules), then when your grammar methods match, your action method with the same name will get called automatically. The method will also be passed the corresponding match object (represented by the $/
variable).
Let's turn to an example.
Grammars by example with actions
Here we are back to our grammar.
Recall that we want to further process the data token "7/notify", to get the 7. To do this, we'll create an action class that has a method with the same name as the named token. In this case, our token is named data
so our method is also named data
.
Now when we pass the URI string through the grammar, the data token match will be passed to the REST-actions' data method. The action method will split the string by the '/' character and the first element of the returned list will be the ID number (7 in the case of "7/notify").
But not really; there's a little more.
Keeping grammars with actions tidy with make
and made
If the grammar calls the action above on data, the data method will be called, but nothing will show up in the big TOP
grammar match result returned to our program. In order to make the action results show up, we need to call make on that result. The result can be many things, including strings, array or hash structures.
You can imagine that the make
places the result in a special contained area for a grammar. Everything that we make
can be accessed later by made.
So instead of the REST-actions class above, we should write:
When we add make
to the match split (which returns a list), the action will return a data structure to the grammar that will be stored separately from the data
token of the original grammar. This way, we can work with both if we need to.
If we want to access just the ID of 7 from that long URI, we access the first element of the list returned from the data
action that we made
:
my = '/product/update/7/notify';my = REST.parse(, actions => REST-actions.new);say <data>.made[0]; # OUTPUT: «7»say <command>.Str; # OUTPUT: «update»
Here we call made
on data, because we want the result of the action that we made
(with make
) to get the split array. That's lovely! But, wouldn't it be lovelier if we could make
a friendlier data structure that contained all of the stuff we want, rather than having to coerce types and remember arrays?
Just like Grammar's TOP
, which matches the entire string, actions have a TOP method as well. We can make
all of the individual match components, like data
or subject
or command
, and then we can place them in a data structure that we will make
in TOP. When we return the final match object, we can then access this data structure.
To do this, we add the method TOP
to the action class and make
whatever data structure we like from the component pieces.
So, our action class becomes:
Here in the TOP
method, the subject
remains the same as the subject we matched in the grammar. Also, command
returns the valid <sym>
that was matched (create, update, retrieve, or delete). We coerce each into .Str
, as well, since we don't need the full match object.
We want to make sure to use the made
method on the $<data>
object, since we want to access the split one that we made
with make
in our action, rather than the proper $<data>
object.
After we make
something in the TOP
method of a grammar action, we can then access all the custom values by calling the made
method on the grammar result object. The code now becomes
my = '/product/update/7/notify';my = REST.parse(, actions => REST-actions.new);my = .made;say <data>[0]; # OUTPUT: «7»say <command>; # OUTPUT: «update»say <subject>; # OUTPUT: «product»
If the complete return match object is not needed, you could return only the made data from your action's TOP
.
my = '/product/update/7/notify';my = REST.parse(, actions => REST-actions.new).made;say <data>[0]; # OUTPUT: «7»say <command>; # OUTPUT: «update»say <subject>; # OUTPUT: «product»
Oh, did we forget to get rid of that ugly array element number? Hmm. Let's make something new in the grammar's custom return in TOP
... how about we call it subject-id
and have it set to element 0 of <data>
.
Now we can do this instead:
my = '/product/update/7/notify';my = REST.parse(, actions => REST-actions.new).made;say <command>; # OUTPUT: «update»say <subject>; # OUTPUT: «product»say <subject-id>; # OUTPUT: «7»
Here's the final code:
Add actions directly
Above we see how to associate grammars with action objects and perform actions on the match object. However, when we want to deal with the match object, that isn't the only way. See the example below:
G.parse('sub f ( a ) { }');# OUTPUT: «func fparam aend f»
This example is a reduced portion of a parser. Let's focus more on the feature it shows.
First, we can add actions inside the grammar itself, and such actions are performed once the control flow of the regex arrives at them. Note that action object's method will always be performed after the whole regex item matched. Second, it shows what make
really does, which is no more than a sugar of $/.made = ...
. And this trick introduces a way to pass messages from within a regex item.
Hopefully this has helped introduce you to the grammars in Perl 6 and shown you how grammars and grammar action classes work together. For more information, check out the more advanced Perl Grammar Guide.
For more grammar debugging, see Grammar::Debugger. This provides breakpoints and color-coded MATCH and FAIL output for each of your grammar tokens.