Recursive Descent Parsers and pyparsing

Yesterday while browsing the table of contents of the May 2008 issue of Python Magazine I came across a reference to the pyparsing module - a python module for writing recursive descent parsers using familiar python grammar.

O'Reilly's Python DevCenter has an excellent introduction to using this module entitled Building Recursive Descent Parsers with Python. Well worth a read.

It just so happens that I have a number of projects which are stalled because writing code to parse complexly structured data is not my strong point. I enjoy parsing up text line by line as much as the next guy but this recursive stuff I find tedious.

The ISC DHCP configuration file is, in my opinion, a good example of parsing complexity. It's configuration directives can contain many optional directives, can be nested, and can be all on a single line or broken up move multiple lines. Writing the parser using pyparsing makes this much simpler.

Here is a simple example of using pyparsing to parse a few host definitions which while simple is quite flexible.


#!/usr/bin/python
from pyparsing import *

# Define the grammar we will use
digits = "0123456789"
colon = Literal(':')
semi = Literal(';')
period = Literal('.')
comma = Literal(',')
lparen = Literal('{')
rparen = Literal('}')
number = Word(digits)
hexint = Word(hexnums,exact=2)
text = Word(alphas)

# Define host configuration specific grammar
host_keyword = Literal('host')
hardware_keyword = Literal('hardware')
ethernet_keyword = Literal('ethernet')
address_keyword = Literal('fixed-address')
mac = Combine(hexint + colon + hexint + colon + hexint + colon + hexint + colon + hexint + colon + hexint).setResultsName("mac_address")
ip = Combine(number + period + number + period + number + period + number)
ips = delimitedList(ip).setResultsName("ip_addresses")
hostname = Combine(text + period + text + period + text).setResultsName("hostname")
ethernet_statement = hardware_keyword + ethernet_keyword + mac + semi

ipaddress_statement = address_keyword + ips + semi
x = host_keyword + hostname + lparen + Optional(ethernet_statement) + Optional(ipaddress_statement) + rparen

# Here is some sample data for us to parse

host_declaration = """
host a.foo.bar {
    hardware ethernet 00:11:22:33:44:55;
    fixed-address 192.168.100.10, 192.168.200.50;
}

host b.foo.bar {
    hardware ethernet 00:0f:12:34:56:78;
    fixed-address 192.168.100.20;
}

host c.foo.bar { hardware ethernet 00:0e:12:34:50:70; fixed-address 192.168.100.40; }
"""

# Do the parsing

results = x.scanString(host_declaration)

# Print out the stuff we're interested

for result in results:
    print result[0].hostname, result[0].mac_address , result[0].ip_addresses

The output looks like this:


a.foo.bar 00:11:22:33:44:55 ['192.168.100.10', '192.168.200.50']
b.foo.bar 00:0f:12:34:56:78 ['192.168.100.20']
c.foo.bar 00:0e:12:34:50:70 ['192.168.100.40']

Comments

ptmcg said…

Welcome to the world of pyparsing! Very nice simple example program, and it shows the pyparsing basics pretty well.

It has been quite a while since I wrote that article, and some of my pyparsing idioms have evolved a bit (as well as new features in pyparsing, too). Here are some that I especially like in writing my parsers:

1. Define many punctuation tokens using map. Instead of:

colon = Literal(':')
semi = Literal(';')
period = Literal('.')
comma = Literal(',')
lparen = Literal('{')
rparen = Literal('}')

I've generally settled on:

colon,semi,period,comma,lparen,rparen = map(Literal,':;.,{}')

(BTW, lparen and rparen usually map to '(' and ')', while '{' and '}' are more typically called lbrace and rbrace.)

2. Short form of setResultsName. Pyparsing 1.4.7 introduced this shortened form - instead of:

ips = delimitedList(ip).setResultsName("ip_addresses")

you can simply write:

ips = delimitedList(ip)("ip_addresses")

I also tend to leave my results name assignments to higher-level expressions, so that I can define a base level token like 'ip_addr', and use it multiple times in a larger expression like:

msg = timestamp("time") + ip_addr("sender") + ip_addr("receiver") + integer("msg_length")

I think that the short form of setResultsName makes this higher-level naming a little easier to follow.

3. Since 1.4.10, repetitive expressions like your mac address can be built up using a multiplication operator:

_hex2 = Word(hexnums,exact=2)
macAddr = Combine( _hex2 + (":" + _hex2) * 5 )

A number of common expressions like this one can be found at the publicly-editable wiki page http://pyparsing-public.wikispaces.com/Helpful+Expressions.

4. In your example, don't forget that hostnames permit characters in addition to alphas; if you extend this example for other parsers, you might do better with something like:

hostnamepart = Word(alphas, alphanums+"_")
hostname = Combine( hostnamepart + ZeroOrMore("." + hostnamepart) )

or even more compact:

hostname = delimitedList( Word(alphas, alphanums+"_"), delim='.', combine=True )

Of course, all of these items are mostly a matter of taste and personal style, and I've tried to maintain as much backward compatibility as possible so that older examples still work with the latest pyparsing versions. There are many other examples online at the wiki, too.

Cheers, and thanks for putting together your blog post on pyparsing!
-- Paul

June 5, 2008 at 3:40 PM

Craig said…

Hi Paul,

Thanks very much for the comments! I really appreciate the feedback.

I have done some extra work on the script since I wrote the original post and I've now also incorporated your suggestions.

Thanks for the repetitive expressions tip - they make things look much neater. I has originally tried to use them but they didn't work because the Ubuntu Hardy package was still 1.4.7. Upgrading to 1.4.11 via source package fixed that problem.

I'll post an updated version of the script in the next day or so.

Craig

June 5, 2008 at 5:54 PM

Anonymous said…

I made a useful tool using your pyparsing of dhcpd.conf for vyatta linux router users.

http://ben.nerp.net/dhcpd-to-vyatta.py

This script takes the dhcpd.conf format and converts it into something vyatta's command syntax can use. Unforunately the command syntax for dhcp static entries is limited and there's no way to translate filename entries at this time.

November 10, 2011 at 7:58 AM

Craig said…

Hi Ben,

Glad you found it useful.

Craig

November 10, 2011 at 11:23 AM

Craig Balfour's Blog

Search This Blog

Recursive Descent Parsers and pyparsing

Labels

Comments

Popular posts from this blog

Sorting a list of IP addresses in Python

Normalizing a MAC address string

More pyparsing and DHCP hosts