Skip to main content

Recursive Descent Parsers and pyparsing

Yesterday while browsing the table of contents of the May 2008 issue of Python Magazine I came across a reference to the pyparsing module - a python module for writing recursive descent parsers using familiar python grammar.

O'Reilly's Python DevCenter has an excellent introduction to using this module entitled Building Recursive Descent Parsers with Python. Well worth a read.

It just so happens that I have a number of projects which are stalled because writing code to parse complexly structured data is not my strong point. I enjoy parsing up text line by line as much as the next guy but this recursive stuff I find tedious.

The ISC DHCP configuration file is, in my opinion, a good example of parsing complexity. It's configuration directives can contain many optional directives, can be nested, and can be all on a single line or broken up move multiple lines. Writing the parser using pyparsing makes this much simpler.

Here is a simple example of using pyparsing to parse a few host definitions which while simple is quite flexible.


#!/usr/bin/python
from pyparsing import *

# Define the grammar we will use
digits = "0123456789"
colon = Literal(':')
semi = Literal(';')
period = Literal('.')
comma = Literal(',')
lparen = Literal('{')
rparen = Literal('}')
number = Word(digits)
hexint = Word(hexnums,exact=2)
text = Word(alphas)

# Define host configuration specific grammar
host_keyword = Literal('host')
hardware_keyword = Literal('hardware')
ethernet_keyword = Literal('ethernet')
address_keyword = Literal('fixed-address')
mac = Combine(hexint + colon + hexint + colon + hexint + colon + hexint + colon + hexint + colon + hexint).setResultsName("mac_address")
ip = Combine(number + period + number + period + number + period + number)
ips = delimitedList(ip).setResultsName("ip_addresses")
hostname = Combine(text + period + text + period + text).setResultsName("hostname")
ethernet_statement = hardware_keyword + ethernet_keyword + mac + semi

ipaddress_statement = address_keyword + ips + semi
x = host_keyword + hostname + lparen + Optional(ethernet_statement) + Optional(ipaddress_statement) + rparen

# Here is some sample data for us to parse

host_declaration = """
host a.foo.bar {
hardware ethernet 00:11:22:33:44:55;
fixed-address 192.168.100.10, 192.168.200.50;
}

host b.foo.bar {
hardware ethernet 00:0f:12:34:56:78;
fixed-address 192.168.100.20;
}

host c.foo.bar { hardware ethernet 00:0e:12:34:50:70; fixed-address 192.168.100.40; }
"""

# Do the parsing

results = x.scanString(host_declaration)

# Print out the stuff we're interested

for result in results:
print result[0].hostname, result[0].mac_address , result[0].ip_addresses


The output looks like this:


a.foo.bar 00:11:22:33:44:55 ['192.168.100.10', '192.168.200.50']
b.foo.bar 00:0f:12:34:56:78 ['192.168.100.20']
c.foo.bar 00:0e:12:34:50:70 ['192.168.100.40']

Comments

ptmcg said…
Welcome to the world of pyparsing! Very nice simple example program, and it shows the pyparsing basics pretty well.

It has been quite a while since I wrote that article, and some of my pyparsing idioms have evolved a bit (as well as new features in pyparsing, too). Here are some that I especially like in writing my parsers:

1. Define many punctuation tokens using map. Instead of:

colon = Literal(':')
semi = Literal(';')
period = Literal('.')
comma = Literal(',')
lparen = Literal('{')
rparen = Literal('}')

I've generally settled on:

colon,semi,period,comma,lparen,rparen = map(Literal,':;.,{}')

(BTW, lparen and rparen usually map to '(' and ')', while '{' and '}' are more typically called lbrace and rbrace.)

2. Short form of setResultsName. Pyparsing 1.4.7 introduced this shortened form - instead of:

ips = delimitedList(ip).setResultsName("ip_addresses")

you can simply write:

ips = delimitedList(ip)("ip_addresses")

I also tend to leave my results name assignments to higher-level expressions, so that I can define a base level token like 'ip_addr', and use it multiple times in a larger expression like:

msg = timestamp("time") + ip_addr("sender") + ip_addr("receiver") + integer("msg_length")

I think that the short form of setResultsName makes this higher-level naming a little easier to follow.

3. Since 1.4.10, repetitive expressions like your mac address can be built up using a multiplication operator:

_hex2 = Word(hexnums,exact=2)
macAddr = Combine( _hex2 + (":" + _hex2) * 5 )

A number of common expressions like this one can be found at the publicly-editable wiki page http://pyparsing-public.wikispaces.com/Helpful+Expressions.


4. In your example, don't forget that hostnames permit characters in addition to alphas; if you extend this example for other parsers, you might do better with something like:

hostnamepart = Word(alphas, alphanums+"_")
hostname = Combine( hostnamepart + ZeroOrMore("." + hostnamepart) )

or even more compact:

hostname = delimitedList( Word(alphas, alphanums+"_"), delim='.', combine=True )


Of course, all of these items are mostly a matter of taste and personal style, and I've tried to maintain as much backward compatibility as possible so that older examples still work with the latest pyparsing versions. There are many other examples online at the wiki, too.

Cheers, and thanks for putting together your blog post on pyparsing!
-- Paul
Craig said…
Hi Paul,

Thanks very much for the comments! I really appreciate the feedback.

I have done some extra work on the script since I wrote the original post and I've now also incorporated your suggestions.

Thanks for the repetitive expressions tip - they make things look much neater. I has originally tried to use them but they didn't work because the Ubuntu Hardy package was still 1.4.7. Upgrading to 1.4.11 via source package fixed that problem.

I'll post an updated version of the script in the next day or so.

Craig
Anonymous said…
I made a useful tool using your pyparsing of dhcpd.conf for vyatta linux router users.

http://ben.nerp.net/dhcpd-to-vyatta.py

This script takes the dhcpd.conf format and converts it into something vyatta's command syntax can use. Unforunately the command syntax for dhcp static entries is limited and there's no way to translate filename entries at this time.
Craig said…
Hi Ben,

Glad you found it useful.

Craig

Popular posts from this blog

Sorting a list of IP addresses in Python

As I work a lot with network data, one of my favourite python modules is iplib . It takes care of quite a few of things I want to do with IP addresses but lacks a lot of functionality of perl's Net::Netmask which I relied on extensively when perl was my favourite language. One of the iplib missing features is a method for sorting a list of IP addresses, or at the very least, a method for comparing two addresses. Luckily this is easy enough to implement yourself in python using a customised sort function. See the Sorting Mini-HOW TO for a well written document on sorting in python. Here is my attempt at a custom function for sorting IP addresses. import iplib ips = ["192.168.100.56", "192.168.0.3", "192.0.0.192", "8.0.0.255"] def ip_compare(x, y): """ Compare two IP addresses. """ # Convert IP addresses to decimal for easy comparison dec_x = int(iplib.convert(x, "dec")) dec_y = int(ipl...

Normalizing a MAC address string

Over the last few days, I have been spending some time working on my python - reading the sections of Diving into Python that I have never got around to and refactoring parts of some of my python scripts to make better use of the features of language and, ultimately, to make them more robust (i.e. usable by people other than me). The script I have started with is a simple one for registering hosts for DHCP access. Basically, it takes two command line arguments - a fully qualified hostname and a MAC address - and then does some validation, checks that neither address is already in use, normalizes the output to the correct format, constructs a properly formatted host stanza and appends it to the end of our ISC DHCP servers dhcpd.conf configuration file. I have made improvements to various parts of the code but the changes I am most conflicted about are those I have made to the MAC address normalization function which works reliably and therefore probably isn't a good candidate for...