Lexical analysis

@token(regex, types=None)

Decorator for token definitions in classes derived from LexerBase.

Parameters:
  • rx (str) – A regular expression defining the possible token values
  • types (List of strings) – A list of token types that this method can recognize. If omitted, the token type is assumed to be the method’s name.

Basic usage:

from ptk.lexer import ReLexer, token

class MyLexer(ReLexer):
    @token(r'[a-zA-Z_][a-zA-Z0-9_]*')
    def identifier(self, tok):
        pass

This will define an identifier token type, which value is the recognized string. The tok parameter holds two attributes, type and value. You can modify the value in place:

from ptk.lexer import ReLexer, token

class MyLexer(ReLexer):
    @token(r'[1-9][0-9]*')
    def number(self, tok):
        tok.value = int(tok.value)

In some cases it may be necessary to change the token’s type as well; for instance to disambiguate between identifiers that are builtins and other ones. In order for the lexer to know which token types can be generated, you should pass a list of strings as the types parameter:

from ptk.lexer import ReLexer, token

class MyLexer(ReLexer):
    @token(r'[a-zA-Z_][a-zA-Z0-9_]*', types=['builtin', 'identifier'])
    def identifier_or_builtin(self, tok):
        tok.type = 'builtin' if tok.value in ['len', 'id'] else 'identifier'

In this case the default value of the type attribute is None and you must set it. Letting None as token type (or setting it to None) will cause the token to be ignored.

Note

The type of token values depends on the type of the strings used to define the regular expressions. Unicode expressions will hold Unicode values, and bytes expressions will hold bytes values.

Note

Disambiguation is done the regular way: if several regular expressions match the input, the longest match is choosen. If the matches are of equal length, the first (in source code order) declaration wins.

exception ptk.lexer.SkipToken[source]

Raise this from your consumer to ignore the token.

exception ptk.lexer.LexerError(char, pos)[source]

Unrecognized token in input

Variables:
  • lineno – Line in input
  • colno – Column in input
class ptk.lexer.LexerBase[source]

This defines the interface for lexer classes. For concrete implementations, see ProgressiveLexer and ReLexer.

class Token(type, value, position)
position

Alias for field number 2

type

Alias for field number 0

value

Alias for field number 1

position()[source]
Returns:The current position in stream as a 2-tuple (column, line).
advanceColumn(count=1)[source]

Advances the current position by count columns.

advanceLine(count=1)[source]

Advances the current position by count lines.

static ignore(char)[source]

Override this to ignore characters in input stream. The default is to ignore spaces and tabs.

Parameters:char – The character to test
Returns:True if char should be ignored
setConsumer(consumer)[source]

Sets the current consumer. A consumer is an object with a feed method; all characters seen on the input stream after the consumer is set are passed directly to it. When the feed method returns a 2-tuple (type, value), the corresponding token is generated and the consumer reset to None. This may be handy to parse tokens that are not easily recognized by a regular expression but easily by code; for instance the following lexer recognizes C strings without having to use negative lookahead:

class MyLexer(ReLexer):
    @token('"')
    def cstring(self, tok):
        class CString(object):
            def __init__(self):
                self.state = 0
                self.value = StringIO.StringIO()
            def feed(self, char):
                if self.state == 0:
                    if char == '"':
                        return 'cstring', self.value.getvalue()
                    if char == '\\':
                        self.state = 1
                    else:
                        self.value.write(char)
                elif self.state == 1:
                    self.value.write(char)
                    self.state = 0
        self.setConsumer(CString())

You can also raise SkipToken instead of returning a token if it is to be ignored (comments).

parse(string)[source]

Parses the whole string

newToken(tok)[source]

This method will be invoked as soon as a token is recognized on input.

Parameters:tok – The token. This is a named tuple with type and value attributes.
classmethod tokenTypes()[source]
Returns:the set of all token names, as strings.
class ptk.lexer.ReLexer[source]

Concrete lexer based on Python regular expressions. this is way faster than ProgressiveLexer but it can only tokenize whole strings.

parse(string)[source]

Parses the whole string

class ptk.lexer.ProgressiveLexer[source]

Concrete lexer based on a simple pure-Python regular expression engine. This lexer is able to tokenize an input stream in a progressive fashion; just call the ProgressiveLexer.feed() method with whatever bytes are available when they’re available. Useful for asynchronous contexts. Starting with Python 3.5 there is also an asynchronous version, see AsyncLexer.

This is slow as hell.

parse(string)[source]

Parses the whole string

feed(char)[source]

Handle a single input character. When you’re finished, call this with EOF as argument.

ptk.lexer.EOF[source]

This is a singleton used to indicate end of stream. It may be used as a token, a token type and a token value. In the first case it is its own type and value.