Generating LexersΒΆ
In order to parse text, you first have to turn that text into individual tokens
with a lexer. Such a lexer can be generated with the
rply.LexerGenerator
.
Lexers are generated by adding rules to a LexerGenerator
instance. Such a
rule consists of a name, which will be used as the type of the token generated
with that rule, and a regular expression defining the piece of text to be
matched.
As an example we will attempt to generate a lexer for simple mathematical expressions:
lg = LexerGenerator()
lg.add('NUMBER', r'\d+')
lg.add('PLUS', r'\+')
lg.add('MINUS', r'-')
We have no defined rules for numbers, an addition and subtraction operator. We can now build a lexer and use it:
>>> l = lg.build()
>>> for token in l.lex('1+1-1'):
... print(token)
...
Token(NUMBER, '1')
Token(ADD, '+')
Token(NUMBER, '1')
Token(MINUS, '-')
Token(NUMBER, '1')
This works quite nicely however there is but a small problem:
>>> for token in l.lex('1 + 1'):
... print(token)
Token('NUMBER', '1')
Traceback (most recent call last):
...
LexingError
What happened is that the lexer is able to match the '1'
at the beginning of
the string and it yields the correct token for that but afterwards the string
' + 1'
is left and no rule matches.
While we do want lexing to continue at that stage, we do not care about
whitespace and would like to ignore it. This can be done using
ignore()
:
lg.ignore(r'\s+')
This adds a rule which will be ignored by the lexer and not produce any tokens:
>>> l = lg.build()
>>> for token in l.lex('1 + 1'):
... print(token)
...
Token('NUMBER', '1')
Token('ADD', '+')
Token('NUMBER', '1')
With this you know everything there is to know about generating lexers.