Generating Parsers

In this part of the tutorial we will generate a parser for simple mathematical expressions as defined by the following BNF grammar:

<expression> ::= "\d+"
               | <expression> "+" <expression>
               | <expression> "-" <expression>
               | <expression> "*" <expression>
               | <expression> "/" <expression>
               | "(" <expression> ")"

Furthermore we use the following lexer:

from rply import LexerGenerator

lg = LexerGenerator()

lg.add('NUMBER', r'\d+')
lg.add('PLUS', r'\+')
lg.add('MINUS', r'-')
lg.add('MUL', r'\*')
lg.add('DIV', r'/')
lg.add('OPEN_PARENS', r'\(')
lg.add('CLOSE_PARENS', r'\)')

lg.ignore('\s+')

lexer = lg.build()

Before we begin working on the parser, we define ourselves an abstract syntax tree:

from rply.token import BaseBox

class Number(BaseBox):
    def __init__(self, value):
        self.value = value

    def eval(self):
        return self.value

class BinaryOp(BaseBox):
    def __init__(self, left, right):
        self.left = left
        self.right = right

class Add(BinaryOp):
    def eval(self):
        return self.left.eval() + self.right.eval()

class Sub(BinaryOp):
    def eval(self):
        return self.left.eval() - self.right.eval()

class Mul(BinaryOp):
    def eval(self):
        return self.left.eval() * self.right.eval()

class Div(BinaryOp):
    def eval(self):
        return self.left.eval() / self.right.eval()

In RPython variables must have a specific type, so we use polymorphism with BaseBox to ensure that. In your own code you can achieve the same by inheriting from BaseBox as we did here. If you are not writing RPython code, you can ignore this completely.

Having covered all that we actually start working on the parser itself:

from rply import ParserGenerator

pg = ParserGenerator(
    # A list of all token names, accepted by the parser.
    ['NUMBER', 'OPEN_PARENS', 'CLOSE_PARENS',
     'PLUS', 'MINUS', 'MUL', 'DIV'
    ],
    # A list of precedence rules with ascending precedence, to
    # disambiguate ambiguous production rules.
    precedence=[
        ('left', ['PLUS', 'MINUS']),
        ('left', ['MUL', 'DIV'])
    ]
)

@pg.production('expression : NUMBER')
def expression_number(p):
    # p is a list of the pieces matched by the right hand side of the
    # rule
    return Number(int(p[0].getstr()))

@pg.production('expression : OPEN_PARENS expression CLOSE_PARENS')
def expression_parens(p):
    return p[1]

@pg.production('expression : expression PLUS expression')
@pg.production('expression : expression MINUS expression')
@pg.production('expression : expression MUL expression')
@pg.production('expression : expression DIV expression')
def expression_binop(p):
    left = p[0]
    right = p[2]
    if p[1].gettokentype() == 'PLUS':
        return Add(left, right)
    elif p[1].gettokentype() == 'MINUS':
        return Sub(left, right)
    elif p[1].gettokentype() == 'MUL':
        return Mul(left, right)
    elif p[1].gettokentype() == 'DIV':
        return Div(left, right)
    else:
        raise AssertionError('Oops, this should not be possible!')

parser = pg.build()

As you can see production rules define a sequence of terminals (tokens) and non-terminals (intermediate values, in this case only expression) using the production() decorator. The function receives a list of the tokens and non-terminals and returns a non-terminal.

In this case we create an abstract syntax tree. We can now use this parser in combination with the lexer given to parse and evaluate mathematical expressions as defined by our grammar:

>>> parser.parse(lexer.lex('1 + 1')).eval()
2
>>> parser.parse(lexer.lex('1 + 2 * 3')).eval()
7

Error Handling

As long as we parse code that is well formed according to our grammar, all is fine but one of the more difficult problems in writing a parser is handling errors.

Per default in case of an error you get a rply.ParsingError:

>>> parser.parse(lexer.lex('1 1'))

This error will not provide any information apart from the position at which it occurred accessible through getsourcepos().

While it is not possible to recover from an error, you can define your own error handler:

@pg.error
def error_handler(token):
    raise ValueError("Ran into a %s where it wasn't expected" % token.gettokentype())

The token passed to the error handler will be the token the parser errored on.

Maintaining State

Sometimes it can be useful to have additional state within the parser, for example as a way to pass information to the parser about the name of the file currently being parsed.

In order to do this we simply define a state object to pass around:

class ParserState(object):
    def __init__(self, filename):
        self.filename = filename

We can pass ParserState objects to the parser simply like this:

parser.parse(lexer.lex(source), state=ParserState('foo.py'))

This will call every production rule and the error handler with the ParserState instance as first argument.

Precedence on rules

Sometimes it is useful to give a rule a manual precedence. For this pass the precedence argument to production. For example, if we wanted to add an implicit multiplication rule to the above language (so that e.g. 16 32 is parsed the same as 16 * 32) we use the following:

@pg.production('expression : expression expression', precedence='MUL')
def implicit_multiplication(p):
    return Mul(p[0], p[1])