Generating LexersΒΆ

In order to parse text, you first have to turn that text into individual tokens with a lexer. Such a lexer can be generated with the rply.LexerGenerator.

Lexers are generated by adding rules to a LexerGenerator instance. Such a rule consists of a name, which will be used as the type of the token generated with that rule, and a regular expression defining the piece of text to be matched.

As an example we will attempt to generate a lexer for simple mathematical expressions:

lg = LexerGenerator()

lg.add('NUMBER', r'\d+')

lg.add('PLUS', r'\+')
lg.add('MINUS', r'-')

We have no defined rules for numbers, an addition and subtraction operator. We can now build a lexer and use it:

>>> l = lg.build()
>>> for token in l.lex('1+1-1'):
...     print(token)
...
Token(NUMBER, '1')
Token(ADD, '+')
Token(NUMBER, '1')
Token(MINUS, '-')
Token(NUMBER, '1')

This works quite nicely however there is but a small problem:

>>> for token in l.lex('1 + 1'):
...     print(token)
Token('NUMBER', '1')
Traceback (most recent call last):
...
LexingError

What happened is that the lexer is able to match the '1' at the beginning of the string and it yields the correct token for that but afterwards the string ' + 1' is left and no rule matches.

While we do want lexing to continue at that stage, we do not care about whitespace and would like to ignore it. This can be done using ignore():

lg.ignore(r'\s+')

This adds a rule which will be ignored by the lexer and not produce any tokens:

>>> l = lg.build()
>>> for token in l.lex('1 + 1'):
...     print(token)
...
Token('NUMBER', '1')
Token('ADD', '+')
Token('NUMBER', '1')

With this you know everything there is to know about generating lexers.