Generating Parsers¶
In this part of the tutorial we will generate a parser for simple mathematical expressions as defined by the following BNF grammar:
<expression> ::= "\d+"
| <expression> "+" <expression>
| <expression> "-" <expression>
| <expression> "*" <expression>
| <expression> "/" <expression>
| "(" <expression> ")"
Furthermore we use the following lexer:
from rply import LexerGenerator
lg = LexerGenerator()
lg.add('NUMBER', r'\d+')
lg.add('PLUS', r'\+')
lg.add('MINUS', r'-')
lg.add('MUL', r'\*')
lg.add('DIV', r'/')
lg.add('OPEN_PARENS', r'\(')
lg.add('CLOSE_PARENS', r'\)')
lg.ignore('\s+')
lexer = lg.build()
Before we begin working on the parser, we define ourselves an abstract syntax tree:
from rply.token import BaseBox
class Number(BaseBox):
def __init__(self, value):
self.value = value
def eval(self):
return self.value
class BinaryOp(BaseBox):
def __init__(self, left, right):
self.left = left
self.right = right
class Add(BinaryOp):
def eval(self):
return self.left.eval() + self.right.eval()
class Sub(BinaryOp):
def eval(self):
return self.left.eval() - self.right.eval()
class Mul(BinaryOp):
def eval(self):
return self.left.eval() * self.right.eval()
class Div(BinaryOp):
def eval(self):
return self.left.eval() / self.right.eval()
In RPython variables must have a specific type, so we use polymorphism with
BaseBox
to ensure that. In your own code you can achieve
the same by inheriting from BaseBox
as we did here. If
you are not writing RPython code, you can ignore this completely.
Having covered all that we actually start working on the parser itself:
from rply import ParserGenerator
pg = ParserGenerator(
# A list of all token names, accepted by the parser.
['NUMBER', 'OPEN_PARENS', 'CLOSE_PARENS',
'PLUS', 'MINUS', 'MUL', 'DIV'
],
# A list of precedence rules with ascending precedence, to
# disambiguate ambiguous production rules.
precedence=[
('left', ['PLUS', 'MINUS']),
('left', ['MUL', 'DIV'])
]
)
@pg.production('expression : NUMBER')
def expression_number(p):
# p is a list of the pieces matched by the right hand side of the
# rule
return Number(int(p[0].getstr()))
@pg.production('expression : OPEN_PARENS expression CLOSE_PARENS')
def expression_parens(p):
return p[1]
@pg.production('expression : expression PLUS expression')
@pg.production('expression : expression MINUS expression')
@pg.production('expression : expression MUL expression')
@pg.production('expression : expression DIV expression')
def expression_binop(p):
left = p[0]
right = p[2]
if p[1].gettokentype() == 'PLUS':
return Add(left, right)
elif p[1].gettokentype() == 'MINUS':
return Sub(left, right)
elif p[1].gettokentype() == 'MUL':
return Mul(left, right)
elif p[1].gettokentype() == 'DIV':
return Div(left, right)
else:
raise AssertionError('Oops, this should not be possible!')
parser = pg.build()
As you can see production rules define a sequence of terminals (tokens) and
non-terminals (intermediate values, in this case only expression) using
the production()
decorator. The function
receives a list of the tokens and non-terminals and returns a non-terminal.
In this case we create an abstract syntax tree. We can now use this parser in combination with the lexer given to parse and evaluate mathematical expressions as defined by our grammar:
>>> parser.parse(lexer.lex('1 + 1')).eval()
2
>>> parser.parse(lexer.lex('1 + 2 * 3')).eval()
7
Error Handling¶
As long as we parse code that is well formed according to our grammar, all is fine but one of the more difficult problems in writing a parser is handling errors.
Per default in case of an error you get a rply.ParsingError
:
>>> parser.parse(lexer.lex('1 1'))
This error will not provide any information apart from the position at which
it occurred accessible through getsourcepos()
.
While it is not possible to recover from an error, you can define your own error handler:
@pg.error
def error_handler(token):
raise ValueError("Ran into a %s where it wasn't expected" % token.gettokentype())
The token passed to the error handler will be the token the parser errored on.
Maintaining State¶
Sometimes it can be useful to have additional state within the parser, for example as a way to pass information to the parser about the name of the file currently being parsed.
In order to do this we simply define a state object to pass around:
class ParserState(object):
def __init__(self, filename):
self.filename = filename
We can pass ParserState objects to the parser simply like this:
parser.parse(lexer.lex(source), state=ParserState('foo.py'))
This will call every production rule and the error handler with the ParserState instance as first argument.
Precedence on rules¶
Sometimes it is useful to give a rule a manual precedence. For this pass the precedence argument to production. For example, if we wanted to add an implicit multiplication rule to the above language (so that e.g. 16 32 is parsed the same as 16 * 32) we use the following:
@pg.production('expression : expression expression', precedence='MUL')
def implicit_multiplication(p):
return Mul(p[0], p[1])