PLY (Python Lex-Yacc) — ply 4.0 documentation (2023)

This document provides an overview of lexing and parsing with PLY.Given the intrinsic complexity of parsing, I strongly advisethat you read (or at least skim) this entire document before jumpinginto a big development project with PLY.

PLY-4.0 requires Python 3.6 or newer. If you’re using an older versionof Python, you’re out of luck. Sorry.

Introduction

PLY is a pure-Python implementation of the compilerconstruction tools lex and yacc. The main goal of PLY is to stayfairly faithful to the way in which traditional lex/yacc tools work.This includes supporting LALR(1) parsing as well as providingextensive input validation, error reporting, and diagnostics. Thus,if you’ve used yacc in another programming language, it should berelatively straightforward to use PLY.

Early versions of PLY were developed to support an Introduction toCompilers Course I taught in 2001 at the University of Chicago. SincePLY was primarily developed as an instructional tool, you will find itto be fairly picky about token and grammar rule specification. Inpart, this added formality is meant to catch common programmingmistakes made by novice users. However, advanced users will also findsuch features to be useful when building complicated grammars for realprogramming languages. It should also be noted that PLY does notprovide much in the way of bells and whistles (e.g., automaticconstruction of abstract syntax trees, tree traversal, etc.). Norwould I consider it to be a parsing framework. Instead, you will finda bare-bones, yet fully capable lex/yacc implementation writtenentirely in Python.

The rest of this document assumes that you are somewhat familiar withparsing theory, syntax directed translation, and the use of compilerconstruction tools such as lex and yacc in other programminglanguages. If you are unfamiliar with these topics, you will probablywant to consult an introductory text such as “Compilers: Principles,Techniques, and Tools”, by Aho, Sethi, and Ullman. O’Reilly’s “Lexand Yacc” by John Levine may also be handy. In fact, the O’Reillybook can be used as a reference for PLY as the concepts are virtuallyidentical.

PLY Overview

PLY consists of two separate modules; lex.py and yacc.py, bothof which are found in a Python package called ply. The lex.pymodule is used to break input text into a collection of tokensspecified by a collection of regular expression rules. yacc.py isused to recognize language syntax that has been specified in the formof a context free grammar.

The two tools are meant to work together. Specifically, lex.pyprovides an interface to produce tokens. yacc.py uses thisretrieve tokens and invoke grammar rules. The output of yacc.pyis often an Abstract Syntax Tree (AST). However, this is entirely upto the user. If desired, yacc.py can also be used to implementsimple one-pass compilers.

Like its Unix counterpart, yacc.py provides most of the featuresyou expect including extensive error checking, grammar validation,support for empty productions, error tokens, and ambiguity resolutionvia precedence rules. In fact, almost everything that is possible intraditional yacc should be supported in PLY.

The primary difference between yacc.py and Unix yacc is thatyacc.py doesn’t involve a separate code-generation process.Instead, PLY relies on reflection (introspection) to build its lexersand parsers. Unlike traditional lex/yacc which require a specialinput file that is converted into a separate source file, thespecifications given to PLY are valid Python programs. Thismeans that there are no extra source files nor is there a specialcompiler construction step (e.g., running yacc to generate Python codefor the compiler).

Lex

lex.py is used to tokenize an input string. For example, supposeyou’re writing a programming language and a user supplied thefollowing input string:

x = 3 + 42 * (s - t)

A tokenizer splits the string into individual tokens:

'x','=', '3', '+', '42', '*', '(', 's', '-', 't', ')'

Tokens are usually given names to indicate what they are. For example:

'ID','EQUALS','NUMBER','PLUS','NUMBER','TIMES','LPAREN','ID','MINUS','ID','RPAREN'

More specifically, the input is broken into pairs of token types andvalues. For example:

('ID','x'), ('EQUALS','='), ('NUMBER','3'),('PLUS','+'), ('NUMBER','42), ('TIMES','*'),('LPAREN','('), ('ID','s'), ('MINUS','-'),('ID','t'), ('RPAREN',')'

The specification of tokens is done by writing a series ofregular expression rules. The next section shows how this is doneusing lex.py.

Lex Example

The following example shows how lex.py is used to write a simple tokenizer:

# ------------------------------------------------------------# calclex.py## tokenizer for a simple expression evaluator for# numbers and +,-,*,/# ------------------------------------------------------------import ply.lex as lex# List of token names. This is always requiredtokens = ( 'NUMBER', 'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'LPAREN', 'RPAREN',)# Regular expression rules for simple tokenst_PLUS = r'\+'t_MINUS = r'-'t_TIMES = r'\*'t_DIVIDE = r'/'t_LPAREN = r'\('t_RPAREN = r'\)'# A regular expression rule with some action codedef t_NUMBER(t): r'\d+' t.value = int(t.value) return t# Define a rule so we can track line numbersdef t_newline(t): r'\n+' t.lexer.lineno += len(t.value)# A string containing ignored characters (spaces and tabs)t_ignore = ' \t'# Error handling ruledef t_error(t): print("Illegal character '%s'" % t.value[0]) t.lexer.skip(1)# Build the lexerlexer = lex.lex()

To use the lexer, you first need to feed it some input text usingits input() method. After that, repeated callsto token() produce tokens. The following code shows how thisworks:

# Test it outdata = '''3 + 4 * 10 + -20 *2'''# Give the lexer some inputlexer.input(data)# Tokenizewhile True: tok = lexer.token() if not tok: break # No more input print(tok)

When executed, the example will produce the following output:

$ python example.pyLexToken(NUMBER,3,2,1)LexToken(PLUS,'+',2,3)LexToken(NUMBER,4,2,5)LexToken(TIMES,'*',2,7)LexToken(NUMBER,10,2,10)LexToken(PLUS,'+',3,14)LexToken(MINUS,'-',3,16)LexToken(NUMBER,20,3,18)LexToken(TIMES,'*',3,20)LexToken(NUMBER,2,3,21)

Lexers also support the iteration protocol. So, you can write theabove loop as follows:

for tok in lexer: print(tok)

The tokens returned by lexer.token() are instances ofLexToken. This object has attributes type, value,lineno, and lexpos. The following code shows anexample of accessing these attributes:

# Tokenizewhile True: tok = lexer.token() if not tok: break # No more input print(tok.type, tok.value, tok.lineno, tok.lexpos)

The type and value attributes contain the type andvalue of the token itself. lineno and lexpos containinformation about the location of the token. lexpos is theindex of the token relative to the start of the input text.

The tokens list

All lexers must provide a list tokens that defines all of thepossible token names that can be produced by the lexer. This list isalways required and is used to perform a variety of validation checks.The tokens list is also used by the yacc.py module to identifyterminals.

In the example, the following code specified the token names:

tokens = ( 'NUMBER', 'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'LPAREN', 'RPAREN',)

Specification of tokens

Each token is specified by writing a regular expression rulecompatible with Python’s re module. Each of these rules aredefined by making declarations with a special prefix t_ toindicate that it defines a token. For simple tokens, the regularexpression can be specified as strings such as this (note: Python rawstrings are used since they are the most convenient way to writeregular expression strings):

t_PLUS = r'\+'

In this case, the name following the t_ must exactly match one ofthe names supplied in tokens. If some kind of action needs to beperformed, a token rule can be specified as a function. For example,this rule matches numbers and converts the string into a Pythoninteger:

def t_NUMBER(t): r'\d+' t.value = int(t.value) return t

When a function is used, the regular expression rule is specified inthe function documentation string. The function always takes a singleargument which is an instance of LexToken. This object hasattributes of type which is the token type (as a string),value which is the lexeme (the actual text matched),lineno which is the current line number, and lexpos whichis the position of the token relative to the beginning of the inputtext. By default, type is set to the name following the t_prefix. The action function can modify the contents of theLexToken object as appropriate. However, when it is done, theresulting token should be returned. If no value is returned by theaction function, the token is discarded and the next tokenread.

Internally, lex.py uses the re module to do its patternmatching. Patterns are compiled using the re.VERBOSE flag whichcan be used to help readability. However, be aware that unescapedwhitespace is ignored and comments are allowed in this mode. If yourpattern involves whitespace, make sure you use \s. If you need tomatch the # character, use [#].

When building the master regular expression, rules are added in thefollowing order:

  1. All tokens defined by functions are added in the same order as theyappear in the lexer file.
  2. Tokens defined by strings are added next by sorting them in orderof decreasing regular expression length (longer expressions are addedfirst).

Without this ordering, it can be difficult to correctly match certaintypes of tokens. For example, if you wanted to have separate tokensfor “=” and “==”, you need to make sure that “==” is checked first.By sorting regular expressions in order of decreasing length, thisproblem is solved for rules defined as strings. For functions, theorder can be explicitly controlled since rules appearing first arechecked first.

To handle reserved words, you should write a single rule to match anidentifier and do a special name lookup in a function like this:

reserved = { 'if' : 'IF', 'then' : 'THEN', 'else' : 'ELSE', 'while' : 'WHILE', ...}tokens = ['LPAREN','RPAREN',...,'ID'] + list(reserved.values())def t_ID(t): r'[a-zA-Z_][a-zA-Z_0-9]*' t.type = reserved.get(t.value,'ID') # Check for reserved words return t

This approach greatly reduces the number of regular expression rulesand is likely to make things a little faster.

Note: You should avoid writing individual rules for reserved words.For example, if you write rules like this:

t_FOR = r'for't_PRINT = r'print'

those rules will be triggered for identifiers that include those wordsas a prefix such as “forget” or “printed”. This is probably not whatyou want.

Token values

When tokens are returned by lex, they have a value that is stored inthe value attribute. Normally, the value is the text that wasmatched. However, the value can be assigned to any Python object.For instance, when lexing identifiers, you may want to return both theidentifier name and information from some sort of symbol table. To dothis, you might write a rule like this:

def t_ID(t): ... # Look up symbol table information and return a tuple t.value = (t.value, symbol_lookup(t.value)) ... return t

It is important to note that storing data in other attribute names isnot recommended. The yacc.py module only exposes thecontents of the value attribute. Thus, accessing other attributesmay be unnecessarily awkward. If you need to store multiple values ona token, assign a tuple, dictionary, or instance to value.

Discarded tokens

To discard a token, such as a comment, define a token rule thatreturns no value. For example:

def t_COMMENT(t): r'\#.*' pass # No return value. Token discarded

Alternatively, you can include the prefix ignore_ in the tokendeclaration to force a token to be ignored. For example:

t_ignore_COMMENT = r'\#.*'

Be advised that if you are ignoring many different kinds of text, youmay still want to use functions since these provide more precisecontrol over the order in which regular expressions are matched (i.e.,functions are matched in order of specification whereas strings aresorted by regular expression length).

Line numbers and positional information

By default, lex.py knows nothing about line numbers. This isbecause lex.py doesn’t know anything about what constitutes a“line” of input (e.g., the newline character or even if the input istextual data). To update this information, you need to write aspecial rule. In the example, the t_newline() rule shows how todo this:

# Define a rule so we can track line numbersdef t_newline(t): r'\n+' t.lexer.lineno += len(t.value)

Within the rule, the lineno attribute of the underlying lexert.lexer is updated. After the line number is updated, the tokenis discarded since nothing is returned.

lex.py does not perform any kind of automatic column tracking.However, it does record positional information related to each tokenin the lexpos attribute. Using this, it is usually possible tocompute column information as a separate step. For instance, justcount backwards until you reach a newline:

# Compute column.# input is the input text string# token is a token instancedef find_column(input, token): line_start = input.rfind('\n', 0, token.lexpos) + 1 return (token.lexpos - line_start) + 1

Since column information is often only useful in the context of errorhandling, calculating the column position can be performed when neededas opposed to doing it for each token. Note: If you’re parsing a languagewhere whitespace matters (i.e., Python), it’s probably better matchwhitespace as a token instead of ignoring it.

Ignored characters

The special t_ignore rule is reserved by lex.py for charactersthat should be completely ignored in the input stream. Usually thisis used to skip over whitespace and other non-essential characters.Although it is possible to define a regular expression rule forwhitespace in a manner similar to t_newline(), the use oft_ignore provides substantially better lexing performance becauseit is handled as a special case and is checked in a much moreefficient manner than the normal regular expression rules.

The characters given in t_ignore are not ignored when suchcharacters are part of other regular expression patterns. Forexample, if you had a rule to capture quoted text, that pattern caninclude the ignored characters (which will be captured in the normalway). The main purpose of t_ignore is to ignore whitespace andother padding between the tokens that you actually want to parse.

Literal characters

Literal characters can be specified by defining a variableliterals in your lexing module. For example:

literals = [ '+','-','*','/' ]

or alternatively:

literals = "+-*/"

A literal character is a single character that is returned “asis” when encountered by the lexer. Literals are checked after all ofthe defined regular expression rules. Thus, if a rule starts with oneof the literal characters, it will always take precedence.

When a literal token is returned, both its type and valueattributes are set to the character itself. For example, '+'.

It’s possible to write token functions that perform additional actionswhen literals are matched. However, you’ll need to set the token typeappropriately. For example:

literals = [ '{', '}' ]def t_lbrace(t): r'\{' t.type = '{' # Set token type to the expected literal return tdef t_rbrace(t): r'\}' t.type = '}' # Set token type to the expected literal return t

Error handling

The t_error() function is used to handle lexing errors that occurwhen illegal characters are detected. In this case, the t.valueattribute contains the rest of the input string that has not beentokenized. In the example, the error function was defined asfollows:

# Error handling ruledef t_error(t): print("Illegal character '%s'" % t.value[0]) t.lexer.skip(1)

In this case, we print the offending character and skip aheadone character by calling t.lexer.skip(1).

EOF Handling

The t_eof() function is used to handle an end-of-file (EOF)condition in the input. As input, it receives a token type 'eof'with the lineno and lexpos attributes set appropriately. Themain use of this function is provide more input to the lexer so thatit can continue to parse. Here is an example of how this works:

# EOF handling ruledef t_eof(t): # Get more input (Example) more = input('... ') if more: self.lexer.input(more) return self.lexer.token() return None

The EOF function should return the next available token (by callingself.lexer.token()) or None to indicate no more data. Beaware that setting more input with the self.lexer.input() methoddoes NOT reset the lexer state or the lineno attribute used forposition tracking. The lexpos attribute is reset so be aware ofthat if you’re using it in error reporting.

Building and using the lexer

To build the lexer, the function lex.lex() is used. For example:

lexer = lex.lex()

This function uses Python reflection (or introspection) to read theregular expression rules out of the calling context and build thelexer. Once the lexer has been built, two methods can be used tocontrol the lexer.

lexer.input(data). Reset the lexer and store a new input string.

lexer.token(). Return the next token. Returns a specialLexToken instance on success or None if the end of the input texthas been reached.

The @TOKEN decorator

In some applications, you may want to define tokens as a series ofmore complex regular expression rules. For example:

digit = r'([0-9])'nondigit = r'([_A-Za-z])'identifier = r'(' + nondigit + r'(' + digit + r'|' + nondigit + r')*)'def t_ID(t): # want docstring to be identifier above. ????? ...

In this case, we want the regular expression rule for ID to be oneof the variables above. However, there is no way to directly specifythis using a normal documentation string. To solve this problem, youcan use the @TOKEN decorator. For example:

from ply.lex import TOKEN@TOKEN(identifier)def t_ID(t): ...

This will attach identifier to the docstring for t_ID()allowing lex.py to work normally. Naturally, you could use @TOKENon all functions as an alternative to using docstrings.

Debugging

For the purpose of debugging, you can run lex() in a debuggingmode as follows:

lexer = lex.lex(debug=True)

This will produce various sorts of debugging information including allof the added rules, the master regular expressions used by the lexer,and tokens generating during lexing.

In addition, lex.py comes with a simple main function which willeither tokenize input read from standard input or from a filespecified on the command line. To use it, put this in yourlexer:

if __name__ == '__main__': lex.runmain()

Please refer to the “Debugging” section near the end for some moreadvanced details of debugging.

Alternative specification of lexers

As shown in the example, lexers are specified all within one Pythonmodule. If you want to put token rules in a different module from theone in which you invoke lex(), use the module keywordargument.

For example, you might have a dedicated module that just contains thetoken rules:

# module: tokrules.py# This module just contains the lexing rules# List of token names. This is always requiredtokens = ( 'NUMBER', 'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'LPAREN', 'RPAREN',)# Regular expression rules for simple tokenst_PLUS = r'\+'t_MINUS = r'-'t_TIMES = r'\*'t_DIVIDE = r'/'t_LPAREN = r'\('t_RPAREN = r'\)'# A regular expression rule with some action codedef t_NUMBER(t): r'\d+' t.value = int(t.value) return t# Define a rule so we can track line numbersdef t_newline(t): r'\n+' t.lexer.lineno += len(t.value)# A string containing ignored characters (spaces and tabs)t_ignore = ' \t'# Error handling ruledef t_error(t): print("Illegal character '%s'" % t.value[0]) t.lexer.skip(1)

Now, if you wanted to build a tokenizer from these rules from within adifferent module, you would do the following (shown for Pythoninteractive mode):

>>> import tokrules>>> lexer = lex.lex(module=tokrules)>>> lexer.input("3 + 4")>>> lexer.token()LexToken(NUMBER,3,1,1,0)>>> lexer.token()LexToken(PLUS,'+',1,2)>>> lexer.token()LexToken(NUMBER,4,1,4)>>> lexer.token()None>>>

The module option can also be used to define lexers from instancesof a class. For example:

import ply.lex as lexclass MyLexer(object): # List of token names. This is always required tokens = ( 'NUMBER', 'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'LPAREN', 'RPAREN', ) # Regular expression rules for simple tokens t_PLUS = r'\+' t_MINUS = r'-' t_TIMES = r'\*' t_DIVIDE = r'/' t_LPAREN = r'\(' t_RPAREN = r'\)' # A regular expression rule with some action code # Note addition of self parameter since we're in a class def t_NUMBER(self,t): r'\d+' t.value = int(t.value) return t # Define a rule so we can track line numbers def t_newline(self,t): r'\n+' t.lexer.lineno += len(t.value) # A string containing ignored characters (spaces and tabs) t_ignore = ' \t' # Error handling rule def t_error(self,t): print("Illegal character '%s'" % t.value[0]) t.lexer.skip(1) # Build the lexer def build(self,**kwargs): self.lexer = lex.lex(module=self, **kwargs) # Test it output def test(self,data): self.lexer.input(data) while True: tok = self.lexer.token() if not tok: break print(tok)# Build the lexer and try it outm = MyLexer()m.build() # Build the lexerm.test("3 + 4") # Test it

When building a lexer from class, you should construct the lexerfrom an instance of the class, not the class object itself. Thisis because PLY only works properly if the lexer actions are defined bybound-methods.

When using the module option to lex(), PLY collects symbolsfrom the underlying object using the dir() function. There is nodirect access to the __dict__ attribute of the object supplied asa module value.

Finally, if you want to keep things nicely encapsulated, but don’twant to use a full-fledged class definition, lexers can be definedusing closures. For example:

import ply.lex as lex# List of token names. This is always requiredtokens = ( 'NUMBER', 'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'LPAREN', 'RPAREN',)def MyLexer(): # Regular expression rules for simple tokens t_PLUS = r'\+' t_MINUS = r'-' t_TIMES = r'\*' t_DIVIDE = r'/' t_LPAREN = r'\(' t_RPAREN = r'\)' # A regular expression rule with some action code def t_NUMBER(t): r'\d+' t.value = int(t.value) return t # Define a rule so we can track line numbers def t_newline(t): r'\n+' t.lexer.lineno += len(t.value) # A string containing ignored characters (spaces and tabs) t_ignore = ' \t' # Error handling rule def t_error(t): print("Illegal character '%s'" % t.value[0]) t.lexer.skip(1) # Build the lexer from my environment and return it return lex.lex()

Important note: If you are defining a lexer using a class or closure,be aware that PLY still requires you to only define a single lexer permodule (source file). There are extensive validation/error checkingparts of the PLY that may falsely report error messages if you don’tfollow this rule.

Maintaining state

In your lexer, you may want to maintain a variety of stateinformation. This might include mode settings, symbol tables, andother details. As an example, suppose that you wanted to keep trackof how many NUMBER tokens had been encountered.

One way to do this is to keep a set of global variables in the modulewhere you created the lexer. For example:

num_count = 0def t_NUMBER(t): r'\d+' global num_count num_count += 1 t.value = int(t.value) return t

If you don’t like the use of a global variable, another place to storeinformation is inside the Lexer object created by lex(). To this,you can use the lexer attribute of tokens passed to the variousrules. For example:

def t_NUMBER(t): r'\d+' t.lexer.num_count += 1 # Note use of lexer attribute t.value = int(t.value) return tlexer = lex.lex()lexer.num_count = 0 # Set the initial count

This latter approach has the advantage of being simple and workingcorrectly in applications where multiple instantiations of a givenlexer exist in the same application. However, this might also feellike a gross violation of encapsulation to OO purists. Just to putyour mind at some ease, all internal attributes of the lexer (with theexception of lineno) have names that are prefixed by lex(e.g., lexdata,``lexpos``, etc.). Thus, it is perfectly safe tostore attributes in the lexer that don’t have names starting with thatprefix or a name that conflicts with one of the predefined methods(e.g., input(), token(), etc.).

If you don’t like assigning values on the lexer object, you can defineyour lexer as a class as shown in the previous section:

class MyLexer: ... def t_NUMBER(self,t): r'\d+' self.num_count += 1 t.value = int(t.value) return t def build(self, **kwargs): self.lexer = lex.lex(object=self,**kwargs) def __init__(self): self.num_count = 0

The class approach may be the easiest to manage if your application isgoing to be creating multiple instances of the same lexer and you needto manage a lot of state.

State can also be managed through closures. For example:

def MyLexer(): num_count = 0 ... def t_NUMBER(t): r'\d+' nonlocal num_count num_count += 1 t.value = int(t.value) return t ...

Lexer cloning

If necessary, a lexer object can be duplicated by invoking itsclone() method. For example:

lexer = lex.lex()...newlexer = lexer.clone()

When a lexer is cloned, the copy is exactly identical to the originallexer including any input text and internal state. However, the cloneallows a different set of input text to be supplied which may beprocessed separately. This may be useful in situations when you arewriting a parser/compiler that involves recursive or reentrantprocessing. For instance, if you needed to scan ahead in the inputfor some reason, you could create a clone and use it to look ahead.Or, if you were implementing some kind of preprocessor, cloned lexerscould be used to handle different input files.

Creating a clone is different than calling lex.lex() in thatPLY doesn’t regenerate any of the internal tables or regular expressions.

Special considerations need to be made when cloning lexers that alsomaintain their own internal state using classes or closures. Namely,you need to be aware that the newly created lexers will share all ofthis state with the original lexer. For example, if you defined alexer as a class and did this:

m = MyLexer()a = lex.lex(object=m) # Create a lexerb = a.clone() # Clone the lexer

Then both a and b are going to be bound to the same objectm and any changes to m will be reflected in both lexers. It’simportant to emphasize that clone() is only meant to create a newlexer that reuses the regular expressions and environment of anotherlexer. If you need to make a totally new copy of a lexer, then calllex() again.

Internal lexer state

A Lexer object lexer has a number of internal attributes that may be useful in certainsituations.

lexer.lexpos
This attribute is an integer that contains the current positionwithin the input text. If you modify the value, it will changethe result of the next call to token(). Within token rulefunctions, this points to the first character after thematched text. If the value is modified within a rule, the nextreturned token will be matched at the new position.
lexer.lineno
The current value of the line number attribute stored in thelexer. PLY only specifies that the attribute exists—it neversets, updates, or performs any processing with it. If you want totrack line numbers, you will need to add code yourself (see thesection on line numbers and positional information).
lexer.lexdata
The current input text stored in the lexer. This is the stringpassed with the input() method. It would probably be a badidea to modify this unless you really know what you’re doing.
lexer.lexmatch
This is the raw Match object returned by the Pythonre.match() function (used internally by PLY) for the currenttoken. If you have written a regular expression that containsnamed groups, you can use this to retrieve those values. Note:This attribute is only updated when tokens are defined andprocessed by functions.

Conditional lexing and start conditions

In advanced parsing applications, it may be useful to have differentlexing states. For instance, you may want the occurrence of a certaintoken or syntactic construct to trigger a different kind of lexing.PLY supports a feature that allows the underlying lexer to be put intoa series of different states. Each state can have its own tokens,lexing rules, and so forth. The implementation is based largely onthe “start condition” feature of GNU flex. Details of this can befound at http://flex.sourceforge.net/manual/Start-Conditions.html

To define a new lexing state, it must first be declared. This is doneby including a “states” declaration in your lex file. For example:

states = ( ('foo','exclusive'), ('bar','inclusive'),)

This declaration declares two states, 'foo' and 'bar'. Statesmay be of two types; 'exclusive' and 'inclusive'. Anexclusive state completely overrides the default behavior of thelexer. That is, lex will only return tokens and apply rules definedspecifically for that state. An inclusive state adds additionaltokens and rules to the default set of rules. Thus, lex will returnboth the tokens defined by default in addition to those defined forthe inclusive state.

Once a state has been declared, tokens and rules are declared byincluding the state name in token/rule declaration. For example:

t_foo_NUMBER = r'\d+' # Token 'NUMBER' in state 'foo't_bar_ID = r'[a-zA-Z_][a-zA-Z0-9_]*' # Token 'ID' in state 'bar'def t_foo_newline(t): r'\n' t.lexer.lineno += 1

A token can be declared in multiple states by including multiple statenames in the declaration. For example:

t_foo_bar_NUMBER = r'\d+' # Defines token 'NUMBER' in both state 'foo' and 'bar'

Alternative, a token can be declared in all states using the ‘ANY’ inthe name:

t_ANY_NUMBER = r'\d+' # Defines a token 'NUMBER' in all states

If no state name is supplied, as is normally the case, the token isassociated with a special state 'INITIAL'. For example, these twodeclarations are identical:

t_NUMBER = r'\d+'t_INITIAL_NUMBER = r'\d+'

States are also associated with the special t_ignore,t_error(), and t_eof() declarations. For example, if a statetreats these differently, you can declare:

t_foo_ignore = " \t\n" # Ignored characters for state 'foo'def t_bar_error(t): # Special error handler for state 'bar' pass

By default, lexing operates in the 'INITIAL' state. This stateincludes all of the normally defined tokens. For users who aren’tusing different states, this fact is completely transparent. If,during lexing or parsing, you want to change the lexing state, use thebegin() method. For example:

def t_begin_foo(t): r'start_foo' t.lexer.begin('foo') # Starts 'foo' state

To get out of a state, you use begin() to switch back to theinitial state. For example:

def t_foo_end(t): r'end_foo' t.lexer.begin('INITIAL') # Back to the initial state

The management of states can also be done with a stack. For example:

def t_begin_foo(t): r'start_foo' t.lexer.push_state('foo') # Starts 'foo' statedef t_foo_end(t): r'end_foo' t.lexer.pop_state() # Back to the previous state

The use of a stack would be useful in situations where there are manyways of entering a new lexing state and you merely want to go back tothe previous state afterwards.

An example might help clarify. Suppose you were writing a parser andyou wanted to grab sections of arbitrary C code enclosed by curlybraces. That is, whenever you encounter a starting brace '{', youwant to read all of the enclosed code up to the ending brace '}' andreturn it as a string. Doing this with a normal regular expressionrule is nearly (if not actually) impossible. This is because bracescan be nested and can be included in comments and strings. Thus,matching up to the first matching '}' character isn’t goodenough. Here is how you might use lexer states to do this:

# Declare the statestates = ( ('ccode','exclusive'),)# Match the first {. Enter ccode state.def t_ccode(t): r'\{' t.lexer.code_start = t.lexer.lexpos # Record the starting position t.lexer.level = 1 # Initial brace level t.lexer.begin('ccode') # Enter 'ccode' state# Rules for the ccode statedef t_ccode_lbrace(t): r'\{' t.lexer.level +=1def t_ccode_rbrace(t): r'\}' t.lexer.level -=1 # If closing brace, return the code fragment if t.lexer.level == 0: t.value = t.lexer.lexdata[t.lexer.code_start:t.lexer.lexpos+1] t.type = "CCODE" t.lexer.lineno += t.value.count('\n') t.lexer.begin('INITIAL') return t# C or C++ comment (ignore)def t_ccode_comment(t): r'(/\*(.|\n)*?\*/)|(//.*)' pass# C stringdef t_ccode_string(t): r'\"([^\\\n]|(\\.))*?\"'# C character literaldef t_ccode_char(t): r'\'([^\\\n]|(\\.))*?\''# Any sequence of non-whitespace characters (not braces, strings)def t_ccode_nonspace(t): r'[^\s\{\}\'\"]+'# Ignored characters (whitespace)t_ccode_ignore = " \t\n"# For bad characters, we just skip over itdef t_ccode_error(t): t.lexer.skip(1)

In this example, the occurrence of the first ‘{’ causes the lexer torecord the starting position and enter a new state 'ccode'. Acollection of rules then match various parts of the input that follow(comments, strings, etc.). All of these rules merely discard thetoken (by not returning a value). However, if the closing right braceis encountered, the rule t_ccode_rbrace collects all of the code(using the earlier recorded starting position), stores it, and returnsa token ‘CCODE’ containing all of that text. When returning thetoken, the lexing state is restored back to its initial state.

Miscellaneous Issues

  • The lexer requires input to be supplied as a single input string.Since most machines have more than enough memory, this rarely presentsa performance concern. However, it means that the lexer currentlycan’t be used with streaming data such as open files or sockets. Thislimitation is primarily a side-effect of using the re module. Youmight be able to work around this by implementing an appropriate deft_eof() end-of-file handling rule. The main complication here isthat you’ll probably need to ensure that data is fed to the lexer in away so that it doesn’t split in in the middle of a token.

  • If you need to supply optional flags to the re.compile() function,use the reflags option to lex. For example:

    lex.lex(reflags=re.UNICODE | re.VERBOSE)

    Note: by default, reflags is set to re.VERBOSE. If you provideyour own flags, you may need to include this for PLY to preserve its normal behavior.

  • If you are going to create a hand-written lexer and you plan to use it with yacc.py,it only needs to conform to the following requirements:

    1. It must provide a token() method that returns the next token orNone if no more tokens are available.
    2. The token() method must return an object tok that hastype and value attributes. If line number tracking isbeing used, then the token should also define a linenoattribute.

Parsing basics

yacc.py is used to parse language syntax. Before showing anexample, there are a few important bits of background that must bementioned. First, syntax is usually specified in terms of aBNF grammar. For example, if you wanted to parse simple arithmeticexpressions, you might first write an unambiguous grammarspecification like this:

expression : expression + term | expression - term | termterm : term * factor | term / factor | factorfactor : NUMBER | ( expression )

In the grammar, symbols such as NUMBER, +, -, *, and/ are known as terminals and correspond to inputtokens. Identifiers such as term and factor refer to grammarrules comprised of a collection of terminals and other rules. Theseidentifiers are known as non-terminals.

The semantic behavior of a language is often specified using atechnique known as syntax directed translation. In syntax directedtranslation, attributes are attached to each symbol in a given grammarrule along with an action. Whenever a particular grammar rule isrecognized, the action describes what to do. For example, given theexpression grammar above, you might write the specification for asimple calculator like this:

Grammar Action-------------------------------- --------------------------------------------expression0 : expression1 + term expression0.val = expression1.val + term.val | expression1 - term expression0.val = expression1.val - term.val | term expression0.val = term.valterm0 : term1 * factor term0.val = term1.val * factor.val | term1 / factor term0.val = term1.val / factor.val | factor term0.val = factor.valfactor : NUMBER factor.val = int(NUMBER.lexval) | ( expression ) factor.val = expression.val

A good way to think about syntax directed translation is to view eachsymbol in the grammar as a kind of object. Associated with each symbolis a value representing its “state” (for example, the valattribute above). Semantic actions are then expressed as a collectionof functions or methods that operate on the symbols and associatedvalues.

Yacc uses a parsing technique known as LR-parsing or shift-reduceparsing. LR parsing is a bottom up technique that tries to recognizethe right-hand-side of various grammar rules. Whenever a validright-hand-side is found in the input, the appropriate action code istriggered and the grammar symbols are replaced by the grammar symbolon the left-hand-side.

LR parsing is commonly implemented by shifting grammar symbols onto astack and looking at the stack and the next input token for patternsthat match one of the grammar rules. The details of the algorithm canbe found in a compiler textbook, but the following example illustratesthe steps that are performed if you wanted to parse the expression 3+ 5 * (10 - 20) using the grammar defined above. In the example,the special symbol $ represents the end of input:

Step Symbol Stack Input Tokens Action---- --------------------- --------------------- -------------------------------1 3 + 5 * ( 10 - 20 )$ Shift 32 3 + 5 * ( 10 - 20 )$ Reduce factor : NUMBER3 factor + 5 * ( 10 - 20 )$ Reduce term : factor4 term + 5 * ( 10 - 20 )$ Reduce expr : term5 expr + 5 * ( 10 - 20 )$ Shift +6 expr + 5 * ( 10 - 20 )$ Shift 57 expr + 5 * ( 10 - 20 )$ Reduce factor : NUMBER8 expr + factor * ( 10 - 20 )$ Reduce term : factor9 expr + term * ( 10 - 20 )$ Shift *10 expr + term * ( 10 - 20 )$ Shift (11 expr + term * ( 10 - 20 )$ Shift 1012 expr + term * ( 10 - 20 )$ Reduce factor : NUMBER13 expr + term * ( factor - 20 )$ Reduce term : factor14 expr + term * ( term - 20 )$ Reduce expr : term15 expr + term * ( expr - 20 )$ Shift -16 expr + term * ( expr - 20 )$ Shift 2017 expr + term * ( expr - 20 )$ Reduce factor : NUMBER18 expr + term * ( expr - factor )$ Reduce term : factor19 expr + term * ( expr - term )$ Reduce expr : expr - term20 expr + term * ( expr )$ Shift )21 expr + term * ( expr ) $ Reduce factor : (expr)22 expr + term * factor $ Reduce term : term * factor23 expr + term $ Reduce expr : expr + term24 expr $ Reduce expr25 $ Success!

When parsing the expression, an underlying state machine and thecurrent input token determine what happens next. If the next tokenlooks like part of a valid grammar rule (based on other items on thestack), it is generally shifted onto the stack. If the top of thestack contains a valid right-hand-side of a grammar rule, it isusually “reduced” and the symbols replaced with the symbol on theleft-hand-side. When this reduction occurs, the appropriate action istriggered (if defined). If the input token can’t be shifted and thetop of stack doesn’t match any grammar rules, a syntax error hasoccurred and the parser must take some kind of recovery step (or bailout). A parse is only successful if the parser reaches a state wherethe symbol stack is empty and there are no more input tokens.

It is important to note that the underlying implementation is builtaround a large finite-state machine that is encoded in a collection oftables. The construction of these tables is non-trivial andbeyond the scope of this discussion. However, subtle details of thisprocess explain why, in the example above, the parser chooses to shifta token onto the stack in step 9 rather than reducing therule expr : expr + term.

Yacc

The ply.yacc module implements the parsing component of PLY.The name “yacc” stands for “Yet Another Compiler Compiler” and isborrowed from the Unix tool of the same name.

An example

Suppose you wanted to make a grammar for simple arithmetic expressionsas previously described. Here is how you would do it withyacc.py:

# Yacc exampleimport ply.yacc as yacc# Get the token map from the lexer. This is required.from calclex import tokensdef p_expression_plus(p): 'expression : expression PLUS term' p[0] = p[1] + p[3]def p_expression_minus(p): 'expression : expression MINUS term' p[0] = p[1] - p[3]def p_expression_term(p): 'expression : term' p[0] = p[1]def p_term_times(p): 'term : term TIMES factor' p[0] = p[1] * p[3]def p_term_div(p): 'term : term DIVIDE factor' p[0] = p[1] / p[3]def p_term_factor(p): 'term : factor' p[0] = p[1]def p_factor_num(p): 'factor : NUMBER' p[0] = p[1]def p_factor_expr(p): 'factor : LPAREN expression RPAREN' p[0] = p[2]# Error rule for syntax errorsdef p_error(p): print("Syntax error in input!")# Build the parserparser = yacc.yacc()while True: try: s = input('calc > ') except EOFError: break if not s: continue result = parser.parse(s) print(result)

In this example, each grammar rule is defined by a Python functionwhere the docstring to that function contains the appropriatecontext-free grammar specification. The statements that make up thefunction body implement the semantic actions of the rule. Eachfunction accepts a single argument p that is a sequence containingthe values of each grammar symbol in the corresponding rule. Thevalues of p[i] are mapped to grammar symbols as shown here:

def p_expression_plus(p): 'expression : expression PLUS term' # ^ ^ ^ ^ # p[0] p[1] p[2] p[3] p[0] = p[1] + p[3]

For tokens, the “value” of the corresponding p[i] is thesame as the p.value attribute assigned in the lexermodule. For non-terminals, the value is determined by whatever isplaced in p[0] when rules are reduced. This value can be anythingat all. However, it probably most common for the value to be a simplePython type, a tuple, or an instance. In this example, we are relyingon the fact that the NUMBER token stores an integer value in itsvalue field. All of the other rules perform various types ofinteger operations and propagate the result.

Note: The use of negative indices have a special meaning inyacc—specially p[-1] does not have the same value as p[3] inthis example. Please see the section on “Embedded Actions” forfurther details.

The first rule defined in the yacc specification determines thestarting grammar symbol (in this case, a rule for expressionappears first). Whenever the starting rule is reduced by the parserand no more input is available, parsing stops and the final value isreturned (this value will be whatever the top-most rule placed inp[0]). Note: an alternative starting symbol can be specified usingthe start keyword argument to yacc().

The p_error(p) rule is defined to catch syntax errors. See theerror handling section below for more detail.

To build the parser, call the yacc.yacc() function. This functionlooks at the module and attempts to construct all of the LR parsingtables for the grammar you have specified.

If any errors are detected in your grammar specification, yacc.pywill produce diagnostic messages and possibly raise an exception.Some of the errors that can be detected include:

  • Duplicated function names (if more than one rule function have the same name in the grammar file).
  • Shift/reduce and reduce/reduce conflicts generated by ambiguous grammars.
  • Badly specified grammar rules.
  • Infinite recursion (rules that can never terminate).
  • Unused rules and tokens
  • Undefined rules and tokens

The next few sections discuss grammar specification in more detail.

The final part of the example shows how to actually run the parsercreated by yacc(). To run the parser, you have to call theparse() with a string of input text. This will run all of thegrammar rules and return the result of the entire parse. This resultreturn is the value assigned to p[0] in the starting grammar rule.

Combining Grammar Rule Functions

When grammar rules are similar, they can be combined into a singlefunction. For example, consider the two rules in our earlierexample:

def p_expression_plus(p): 'expression : expression PLUS term' p[0] = p[1] + p[3]def p_expression_minus(t): 'expression : expression MINUS term' p[0] = p[1] - p[3]

Instead of writing two functions, you might write a single functionlike this:

def p_expression(p): '''expression : expression PLUS term | expression MINUS term''' if p[2] == '+': p[0] = p[1] + p[3] elif p[2] == '-': p[0] = p[1] - p[3]

In general, the docstring for any given function can contain multiplegrammar rules. So, it would have also been legal (although possiblyconfusing) to write this:

def p_binary_operators(p): '''expression : expression PLUS term | expression MINUS term term : term TIMES factor | term DIVIDE factor''' if p[2] == '+': p[0] = p[1] + p[3] elif p[2] == '-': p[0] = p[1] - p[3] elif p[2] == '*': p[0] = p[1] * p[3] elif p[2] == '/': p[0] = p[1] / p[3]

When combining grammar rules into a single function, it is usually agood idea for all of the rules to have a similar structure (e.g., thesame number of terms). Otherwise, the corresponding action code maybe more complicated than necessary. However, it is possible to handlesimple cases using len(). For example:

def p_expressions(p): '''expression : expression MINUS expression | MINUS expression''' if (len(p) == 4): p[0] = p[1] - p[3] elif (len(p) == 3): p[0] = -p[2]

If parsing performance is a concern, you should resist the urge to puttoo much conditional processing into a single grammar rule as shown inthese examples. When you add checks to see which grammar rule isbeing handled, you are actually duplicating the work that the parserhas already performed (i.e., the parser already knows exactly whatrule it matched). You can eliminate this overhead by using a separatep_rule() function for each grammar rule.

Character Literals

If desired, a grammar may contain tokens defined as single characterliterals. For example:

def p_binary_operators(p): '''expression : expression '+' term | expression '-' term term : term '*' factor | term '/' factor''' if p[2] == '+': p[0] = p[1] + p[3] elif p[2] == '-': p[0] = p[1] - p[3] elif p[2] == '*': p[0] = p[1] * p[3] elif p[2] == '/': p[0] = p[1] / p[3]

A character literal must be enclosed in quotes such as '+'. Inaddition, if literals are used, they must be declared in thecorresponding lex file through the use of a special literalsdeclaration:

# Literals. Should be placed in module given to lex()literals = ['+','-','*','/' ]

Character literals are limited to a single character. Thus, it is notlegal to specify literals such as '<=' or '=='. For this,use the normal lexing rules (e.g., define a rule such as t_EQ =r'==').

Empty Productions

yacc.py can handle empty productions by defining a rule like this:

def p_empty(p): 'empty :' pass

Now to use the empty production, use ‘empty’ as a symbol. Forexample:

def p_optitem(p): 'optitem : item' ' | empty' ...

Note: You can write empty rules anywhere by specifying an emptyright hand side. However, I personally find that writing an “empty”rule and using “empty” to denote an empty production is easier to readand more clearly states your intentions.

Changing the starting symbol

Normally, the first rule found in a yacc specification defines thestarting grammar rule (top level rule). To change this, supplya start specifier in your file. For example:

start = 'foo'def p_bar(p): 'bar : A B'# This is the starting rule due to the start specifier abovedef p_foo(p): 'foo : bar X'...

The use of a start specifier may be useful during debuggingsince you can use it to have yacc build a subset of a larger grammar.For this purpose, it is also possible to specify a starting symbol asan argument to yacc(). For example:

parser = yacc.yacc(start='foo')

Dealing With Ambiguous Grammars

The expression grammar given in the earlier example has been writtenin a special format to eliminate ambiguity. However, in manysituations, it is extremely difficult or awkward to write grammars inthis format. A much more natural way to express the grammar is in amore compact form like this:

expression : expression PLUS expression | expression MINUS expression | expression TIMES expression | expression DIVIDE expression | LPAREN expression RPAREN | NUMBER

Unfortunately, this grammar specification is ambiguous. For example,if you are parsing the string “3 * 4 + 5”, there is no way to tell howthe operators are supposed to be grouped. For example, does theexpression mean “(3 * 4) + 5” or is it “3 * (4+5)”?

When an ambiguous grammar is given to yacc.py it will printmessages about “shift/reduce conflicts” or “reduce/reduce conflicts”.A shift/reduce conflict is caused when the parser generator can’tdecide whether or not to reduce a rule or shift a symbol on theparsing stack. For example, consider the string “3 * 4 + 5” and theinternal parsing stack:

Step Symbol Stack Input Tokens Action---- --------------------- --------------------- -------------------------------1 $ 3 * 4 + 5$ Shift 32 $ 3 * 4 + 5$ Reduce : expression : NUMBER3 $ expr * 4 + 5$ Shift *4 $ expr * 4 + 5$ Shift 45 $ expr * 4 + 5$ Reduce: expression : NUMBER6 $ expr * expr + 5$ SHIFT/REDUCE CONFLICT ????

In this case, when the parser reaches step 6, it has two options. Oneis to reduce the rule expr : expr * expr on the stack. The otheroption is to shift the token + on the stack. Both options areperfectly legal from the rules of the context-free-grammar.

By default, all shift/reduce conflicts are resolved in favor ofshifting. Therefore, in the above example, the parser will alwaysshift the + instead of reducing. Although this strategy works inmany cases (for example, the case of “if-then” versus “if-then-else”),it is not enough for arithmetic expressions. In fact, in the aboveexample, the decision to shift + is completely wrong—we shouldhave reduced expr * expr since multiplication has highermathematical precedence than addition.

To resolve ambiguity, especially in expression grammars, yacc.pyallows individual tokens to be assigned a precedence level andassociativity. This is done by adding a variable precedence tothe grammar file like this:

precedence = ( ('left', 'PLUS', 'MINUS'), ('left', 'TIMES', 'DIVIDE'),)

This declaration specifies that PLUS/MINUS have the sameprecedence level and are left-associative and thatTIMES/DIVIDE have the same precedence and areleft-associative. Within the precedence declaration, tokens areordered from lowest to highest precedence. Thus, this declarationspecifies that TIMES/DIVIDE have higher precedence thanPLUS/MINUS (since they appear later in the precedencespecification).

The precedence specification works by associating a numericalprecedence level value and associativity direction to the listedtokens. For example, in the above example you get:

PLUS : level = 1, assoc = 'left'MINUS : level = 1, assoc = 'left'TIMES : level = 2, assoc = 'left'DIVIDE : level = 2, assoc = 'left'

These values are then used to attach a numerical precedence value andassociativity direction to each grammar rule. This is alwaysdetermined by looking at the precedence of the right-most terminalsymbol. For example:

expression : expression PLUS expression # level = 1, left | expression MINUS expression # level = 1, left | expression TIMES expression # level = 2, left | expression DIVIDE expression # level = 2, left | LPAREN expression RPAREN # level = None (not specified) | NUMBER # level = None (not specified)

When shift/reduce conflicts are encountered, the parser generatorresolves the conflict by looking at the precedence rules andassociativity specifiers.

  1. If the current token has higher precedence than the rule on the stack, it is shifted.
  2. If the grammar rule on the stack has higher precedence, the rule is reduced.
  3. If the current token and the grammar rule have the same precedence, therule is reduced for left associativity, whereas the token is shifted for right associativity.
  4. If nothing is known about the precedence, shift/reduce conflicts are resolved infavor of shifting (the default).

For example, if “expression PLUS expression” has been parsed and thenext token is “TIMES”, the action is going to be a shift because“TIMES” has a higher precedence level than “PLUS”. On the other hand,if “expression TIMES expression” has been parsed and the next token is“PLUS”, the action is going to be reduce because “PLUS” has a lowerprecedence than “TIMES.”

When shift/reduce conflicts are resolved using the first threetechniques (with the help of precedence rules), yacc.py willreport no errors or conflicts in the grammar (although it will printsome information in the parser.out debugging file).

One problem with the precedence specifier technique is that it issometimes necessary to change the precedence of an operator in certaincontexts. For example, consider a unary-minus operator in “3 + 4 *-5”. Mathematically, the unary minus is normally given a very highprecedence–being evaluated before the multiply. However, in ourprecedence specifier, MINUS has a lower precedence than TIMES. Todeal with this, precedence rules can be given for so-called“fictitious tokens” like this:

precedence = ( ('left', 'PLUS', 'MINUS'), ('left', 'TIMES', 'DIVIDE'), ('right', 'UMINUS'), # Unary minus operator)

Now, in the grammar file, we can write our unary minus rule likethis:

def p_expr_uminus(p): 'expression : MINUS expression %prec UMINUS' p[0] = -p[2]

In this case, %prec UMINUS overrides the default ruleprecedence–setting it to that of UMINUS in the precedence specifier.

At first, the use of UMINUS in this example may appear very confusing.UMINUS is not an input token or a grammar rule. Instead, you shouldthink of it as the name of a special marker in the precedence table.When you use the %prec qualifier, you’re telling yacc thatyou want the precedence of the expression to be the same as for thisspecial marker instead of the usual precedence.

It is also possible to specify non-associativity in the precedencetable. This would be used when you don’t want operations tochain together. For example, suppose you wanted to support comparisonoperators like < and > but you didn’t want to allowcombinations like a < b < c. To do this, specify arule like this:

precedence = ( ('nonassoc', 'LESSTHAN', 'GREATERTHAN'), # Nonassociative operators ('left', 'PLUS', 'MINUS'), ('left', 'TIMES', 'DIVIDE'), ('right', 'UMINUS'), # Unary minus operator)

If you do this, the occurrence of input text such as a < b < cwill result in a syntax error. However, simple expressions suchas a < b will still be fine.

Reduce/reduce conflicts are caused when there are multiple grammarrules that can be applied to a given set of symbols. This kind ofconflict is almost always bad and is always resolved by picking therule that appears first in the grammar file. Reduce/reduce conflictsare almost always caused when different sets of grammar rules somehowgenerate the same set of symbols. For example:

assignment : ID EQUALS NUMBER | ID EQUALS expressionexpression : expression PLUS expression | expression MINUS expression | expression TIMES expression | expression DIVIDE expression | LPAREN expression RPAREN | NUMBER

In this case, a reduce/reduce conflict exists between these two rules:

assignment : ID EQUALS NUMBERexpression : NUMBER

For example, if you wrote “a = 5”, the parser can’t figure out if thisis supposed to be reduced as assignment : ID EQUALS NUMBER orwhether it’s supposed to reduce the 5 as an expression and then reducethe rule assignment : ID EQUALS expression.

It should be noted that reduce/reduce conflicts are notoriouslydifficult to spot looking at the input grammar. When areduce/reduce conflict occurs, yacc() will try to help by printinga warning message such as this:

WARNING: 1 reduce/reduce conflictWARNING: reduce/reduce conflict in state 15 resolved using rule (assignment -> ID EQUALS NUMBER)WARNING: rejected rule (expression -> NUMBER)

This message identifies the two rules that are in conflict. However,it may not tell you how the parser arrived at such a state. To tryand figure it out, you’ll probably have to look at your grammar andthe contents of the parser.out debugging file with anappropriately high level of caffeination.

The parser.out file

Tracking down shift/reduce and reduce/reduce conflicts is one of thefiner pleasures of using an LR parsing algorithm. To assist indebugging, yacc.py can create a debugging file called ‘parser.out’.To create this file, use yacc.yacc(debug=True).The contents of this file look like the following:

Unused terminals:GrammarRule 1 expression -> expression PLUS expressionRule 2 expression -> expression MINUS expressionRule 3 expression -> expression TIMES expressionRule 4 expression -> expression DIVIDE expressionRule 5 expression -> NUMBERRule 6 expression -> LPAREN expression RPARENTerminals, with rules where they appearTIMES : 3error :MINUS : 2RPAREN : 6LPAREN : 6DIVIDE : 4PLUS : 1NUMBER : 5Nonterminals, with rules where they appearexpression : 1 1 2 2 3 3 4 4 6 0Parsing method: LALRstate 0 S' -> . expression expression -> . expression PLUS expression expression -> . expression MINUS expression expression -> . expression TIMES expression expression -> . expression DIVIDE expression expression -> . NUMBER expression -> . LPAREN expression RPAREN NUMBER shift and go to state 3 LPAREN shift and go to state 2state 1 S' -> expression . expression -> expression . PLUS expression expression -> expression . MINUS expression expression -> expression . TIMES expression expression -> expression . DIVIDE expression PLUS shift and go to state 6 MINUS shift and go to state 5 TIMES shift and go to state 4 DIVIDE shift and go to state 7state 2 expression -> LPAREN . expression RPAREN expression -> . expression PLUS expression expression -> . expression MINUS expression expression -> . expression TIMES expression expression -> . expression DIVIDE expression expression -> . NUMBER expression -> . LPAREN expression RPAREN NUMBER shift and go to state 3 LPAREN shift and go to state 2state 3 expression -> NUMBER . $ reduce using rule 5 PLUS reduce using rule 5 MINUS reduce using rule 5 TIMES reduce using rule 5 DIVIDE reduce using rule 5 RPAREN reduce using rule 5state 4 expression -> expression TIMES . expression expression -> . expression PLUS expression expression -> . expression MINUS expression expression -> . expression TIMES expression expression -> . expression DIVIDE expression expression -> . NUMBER expression -> . LPAREN expression RPAREN NUMBER shift and go to state 3 LPAREN shift and go to state 2state 5 expression -> expression MINUS . expression expression -> . expression PLUS expression expression -> . expression MINUS expression expression -> . expression TIMES expression expression -> . expression DIVIDE expression expression -> . NUMBER expression -> . LPAREN expression RPAREN NUMBER shift and go to state 3 LPAREN shift and go to state 2state 6 expression -> expression PLUS . expression expression -> . expression PLUS expression expression -> . expression MINUS expression expression -> . expression TIMES expression expression -> . expression DIVIDE expression expression -> . NUMBER expression -> . LPAREN expression RPAREN NUMBER shift and go to state 3 LPAREN shift and go to state 2state 7 expression -> expression DIVIDE . expression expression -> . expression PLUS expression expression -> . expression MINUS expression expression -> . expression TIMES expression expression -> . expression DIVIDE expression expression -> . NUMBER expression -> . LPAREN expression RPAREN NUMBER shift and go to state 3 LPAREN shift and go to state 2state 8 expression -> LPAREN expression . RPAREN expression -> expression . PLUS expression expression -> expression . MINUS expression expression -> expression . TIMES expression expression -> expression . DIVIDE expression RPAREN shift and go to state 13 PLUS shift and go to state 6 MINUS shift and go to state 5 TIMES shift and go to state 4 DIVIDE shift and go to state 7state 9 expression -> expression TIMES expression . expression -> expression . PLUS expression expression -> expression . MINUS expression expression -> expression . TIMES expression expression -> expression . DIVIDE expression $ reduce using rule 3 PLUS reduce using rule 3 MINUS reduce using rule 3 TIMES reduce using rule 3 DIVIDE reduce using rule 3 RPAREN reduce using rule 3 ! PLUS [ shift and go to state 6 ] ! MINUS [ shift and go to state 5 ] ! TIMES [ shift and go to state 4 ] ! DIVIDE [ shift and go to state 7 ]state 10 expression -> expression MINUS expression . expression -> expression . PLUS expression expression -> expression . MINUS expression expression -> expression . TIMES expression expression -> expression . DIVIDE expression $ reduce using rule 2 PLUS reduce using rule 2 MINUS reduce using rule 2 RPAREN reduce using rule 2 TIMES shift and go to state 4 DIVIDE shift and go to state 7 ! TIMES [ reduce using rule 2 ] ! DIVIDE [ reduce using rule 2 ] ! PLUS [ shift and go to state 6 ] ! MINUS [ shift and go to state 5 ]state 11 expression -> expression PLUS expression . expression -> expression . PLUS expression expression -> expression . MINUS expression expression -> expression . TIMES expression expression -> expression . DIVIDE expression $ reduce using rule 1 PLUS reduce using rule 1 MINUS reduce using rule 1 RPAREN reduce using rule 1 TIMES shift and go to state 4 DIVIDE shift and go to state 7 ! TIMES [ reduce using rule 1 ] ! DIVIDE [ reduce using rule 1 ] ! PLUS [ shift and go to state 6 ] ! MINUS [ shift and go to state 5 ]state 12 expression -> expression DIVIDE expression . expression -> expression . PLUS expression expression -> expression . MINUS expression expression -> expression . TIMES expression expression -> expression . DIVIDE expression $ reduce using rule 4 PLUS reduce using rule 4 MINUS reduce using rule 4 TIMES reduce using rule 4 DIVIDE reduce using rule 4 RPAREN reduce using rule 4 ! PLUS [ shift and go to state 6 ] ! MINUS [ shift and go to state 5 ] ! TIMES [ shift and go to state 4 ] ! DIVIDE [ shift and go to state 7 ]state 13 expression -> LPAREN expression RPAREN . $ reduce using rule 6 PLUS reduce using rule 6 MINUS reduce using rule 6 TIMES reduce using rule 6 DIVIDE reduce using rule 6 RPAREN reduce using rule 6

The different states that appear in this file are a representation ofevery possible sequence of valid input tokens allowed by the grammar.When receiving input tokens, the parser is building up a stack andlooking for matching rules. Each state keeps track of the grammarrules that might be in the process of being matched at that point.Within each rule, the “.” character indicates the current location ofthe parse within that rule. In addition, the actions for each validinput token are listed. When a shift/reduce or reduce/reduce conflictarises, rules not selected are prefixed with an !. Forexample:

! TIMES [ reduce using rule 2 ]! DIVIDE [ reduce using rule 2 ]! PLUS [ shift and go to state 6 ]! MINUS [ shift and go to state 5 ]

By looking at these rules (and with a little practice), you canusually track down the source of most parsing conflicts. It shouldalso be stressed that not all shift-reduce conflicts are bad.However, the only way to be sure that they are resolved correctly isto look at parser.out.

Syntax Error Handling

If you are creating a parser for production use, the handling ofsyntax errors is important. As a general rule, you don’t want aparser to throw up its hands and stop at the first sign oftrouble. Instead, you want it to report the error, recover ifpossible, and continue parsing so that all of the errors in the inputget reported to the user at once. This is the standard behavior foundin compilers for languages such as C, C++, and Java.

In PLY, when a syntax error occurs during parsing, the error isimmediately detected (i.e., the parser does not read any more tokensbeyond the source of the error). However, at this point, the parserenters a recovery mode that can be used to try and continue furtherparsing. As a general rule, error recovery in LR parsers is adelicate topic that involves ancient rituals and black-magic. Therecovery mechanism provided by yacc.py is comparable to Unix yaccso you may want consult a book like O’Reilly’s “Lex and Yacc” for someof the finer details.

When a syntax error occurs, yacc.py performs the following steps:

  1. On the first occurrence of an error, the user-defined p_error()function is called with the offending token as anargument. However, if the syntax error is due to reaching theend-of-file, p_error() is called with an argument of None.Afterwards, the parser enters an “error-recovery” mode in which itwill not make future calls to p_error() until it hassuccessfully shifted at least 3 tokens onto the parsing stack.
  2. If no recovery action is taken in p_error(), the offendinglookahead token is replaced with a special error token.
  3. If the offending lookahead token is already set to error, thetop item of the parsing stack is deleted.
  4. If the entire parsing stack is unwound, the parser enters a restartstate and attempts to start parsing from its initial state.
  5. If a grammar rule accepts error as a token, it will beshifted onto the parsing stack.
  6. If the top item of the parsing stack is error, lookahead tokenswill be discarded until the parser can successfully shift a newsymbol or reduce a rule involving error.

Recovery and resynchronization with error rules

The most well-behaved approach for handling syntax errors is to writegrammar rules that include the error token. For example, supposeyour language had a grammar rule for a print statement like this:

def p_statement_print(p): 'statement : PRINT expr SEMI' ...

To account for the possibility of a bad expression, you might write anadditional grammar rule like this:

def p_statement_print_error(p): 'statement : PRINT error SEMI' print("Syntax error in print statement. Bad expression")

In this case, the error token will match any sequence oftokens that might appear up to the first semicolon that isencountered. Once the semicolon is reached, the rule will beinvoked and the error token will go away.

This type of recovery is sometimes known as parser resynchronization.The error token acts as a wildcard for any bad input text andthe token immediately following error acts as asynchronization token.

It is important to note that the error token usually does notappear as the last token on the right in an error rule. For example:

def p_statement_print_error(p): 'statement : PRINT error' print("Syntax error in print statement. Bad expression")

This is because the first bad token encountered will cause the rule tobe reduced–which may make it difficult to recover if more bad tokensimmediately follow.

Panic mode recovery

An alternative error recovery scheme is to enter a panic mode recoveryin which tokens are discarded to a point where the parser might beable to recover in some sensible manner.

Panic mode recovery is implemented entirely in the p_error()function. For example, this function starts discarding tokens untilit reaches a closing ‘}’. Then, it restarts the parser in its initialstate:

def p_error(p): print("Whoa. You are seriously hosed.") if not p: print("End of File!") return # Read ahead looking for a closing '}' while True: tok = parser.token() # Get the next token if not tok or tok.type == 'RBRACE': break parser.restart()

This function discards the bad token and tells the parser thatthe error was ok:

def p_error(p): if p: print("Syntax error at token", p.type) # Just discard the token and tell the parser it's okay. parser.errok() else: print("Syntax error at EOF")

More information on these methods is as follows:

parser.errok()
This resets the parser state so it doesn’t think it’s in error-recoverymode. This will prevent an error token from being generated and will reset the internalerror counters so that the next syntax error will call p_error() again.
parser.token()
This returns the next token on the input stream.
parser.restart().
This discards the entire parsing stack and resets the parserto its initial state.

To supply the next lookahead token to the parser, p_error() canreturn a token. This might be useful if trying to synchronize onspecial characters. For example:

def p_error(p): # Read ahead looking for a terminating ";" while True: tok = parser.token() # Get the next token if not tok or tok.type == 'SEMI': break parser.errok() # Return SEMI to the parser as the next lookahead token return tok

Keep in mind in that the above error handling functions, parser isan instance of the parser created by yacc(). You’ll need to savethis instance someplace in your code so that you can refer to itduring error handling.

Signalling an error from a production

If necessary, a production rule can manually force the parser to entererror recovery. This is done by raising the SyntaxError exceptionlike this:

def p_production(p): 'production : some production ...' raise SyntaxError

The effect of raising SyntaxError is the same as if the lastsymbol shifted onto the parsing stack was actually a syntax error.Thus, when you do this, the last symbol shifted is popped off of theparsing stack and the current lookahead token is set to an errortoken. The parser then enters error-recovery mode where it tries toreduce rules that can accept error tokens. The steps that followfrom this point are exactly the same as if a syntax error weredetected and p_error() were called.

One important aspect of manually setting an error is that thep_error() function will NOT be called in this case. If you needto issue an error message, make sure you do it in the production thatraises SyntaxError.

Note: This feature of PLY is meant to mimic the behavior of theYYERROR macro in yacc.

When Do Syntax Errors Get Reported?

In most cases, yacc will handle errors as soon as a bad input token isdetected on the input. However, be aware that yacc may choose todelay error handling until after it has reduced one or more grammarrules first. This behavior might be unexpected, but it’s related tospecial states in the underlying parsing table known as “defaultedstates.” A defaulted state is parsing condition where the samegrammar rule will be reduced regardless of what valid tokencomes next on the input. For such states, yacc chooses to go aheadand reduce the grammar rule without reading the next inputtoken. If the next token is bad, yacc will eventually get aroundto reading it and report a syntax error. It’s just a little unusualin that you might see some of your grammar rules firing immediatelyprior to the syntax error.

Usually, the delayed error reporting with defaulted states is harmless(and there are other reasons for wanting PLY to behave in this way).However, if you need to turn this behavior off for some reason. Youcan clear the defaulted states table like this:

parser = yacc.yacc()parser.defaulted_states = {}

Disabling defaulted states is not recommended if your grammar makesuse of embedded actions as described in Section 6.11.

General comments on error handling

For normal types of languages, error recovery with error rules andresynchronization characters is probably the most reliabletechnique. This is because you can instrument the grammar to catcherrors at selected places where it is relatively easy to recover andcontinue parsing. Panic mode recovery is really only useful incertain specialized applications where you might want to discard hugeportions of the input text to find a valid restart point.

Line Number and Position Tracking

Position tracking is often a tricky problem when writing compilers.By default, PLY tracks the line number and position of all tokens.This information is available using the following functions:

p.lineno(num). Return the line number for symbol num

p.lexpos(num). Return the lexing position for symbol num

For example:

def p_expression(p): 'expression : expression PLUS expression' line = p.lineno(2) # line number of the PLUS token index = p.lexpos(2) # Position of the PLUS token

As an optional feature, yacc.py can automatically track linenumbers and positions for all of the grammar symbols as well.However, this extra tracking requires extra processing and cansignificantly slow down parsing. Therefore, it must be enabled bypassing the tracking=True option to yacc.parse(). Forexample:

yacc.parse(data,tracking=True)

Once enabled, the lineno() and lexpos() methods work for allgrammar symbols. In addition, two additional methods can be used:

p.linespan(num). Return a tuple (startline,endline) with the starting and ending line number for symbol num.

p.lexspan(num). Return a tuple (start,end) with the starting and ending positions for symbol num.

For example:

def p_expression(p): 'expression : expression PLUS expression' p.lineno(1) # Line number of the left expression p.lineno(2) # line number of the PLUS operator p.lineno(3) # line number of the right expression ... start,end = p.linespan(3) # Start,end lines of the right expression starti,endi = p.lexspan(3) # Start,end positions of right expression

Note: The lexspan() function only returns the range of values upto the start of the last grammar symbol.

Although it may be convenient for PLY to track position information onall grammar symbols, this is often unnecessary. For example, if youare merely using line number information in an error message, you canoften just key off of a specific token in the grammar rule. Forexample:

def p_bad_func(p): 'funccall : fname LPAREN error RPAREN' # Line number reported from LPAREN token print("Bad function call at line", p.lineno(2))

Similarly, you may get better parsing performance if you onlyselectively propagate line number information where it’s needed usingthe p.set_lineno() method. For example:

def p_fname(p): 'fname : ID' p[0] = p[1] p.set_lineno(0,p.lineno(1))

PLY doesn’t retain line number information from rules that havealready been parsed. If you are building an abstract syntax tree andneed to have line numbers, you should make sure that the line numbersappear in the tree itself.

AST Construction

yacc.py provides no special functions for constructing an abstractsyntax tree. However, such construction is easy enough to do on yourown.

A minimal way to construct a tree is to create and propagate atuple or list in each grammar rule function. There are many possibleways to do this, but one example would be something like this:

def p_expression_binop(p): '''expression : expression PLUS expression | expression MINUS expression | expression TIMES expression | expression DIVIDE expression''' p[0] = ('binary-expression',p[2],p[1],p[3])def p_expression_group(p): 'expression : LPAREN expression RPAREN' p[0] = ('group-expression',p[2])def p_expression_number(p): 'expression : NUMBER' p[0] = ('number-expression',p[1])

Another approach is to create a set of data structure for differentkinds of abstract syntax tree nodes and assign nodes to p[0] ineach rule. For example:

class Expr: passclass BinOp(Expr): def __init__(self,left,op,right): self.left = left self.right = right self.op = opclass Number(Expr): def __init__(self,value): self.value = valuedef p_expression_binop(p): '''expression : expression PLUS expression | expression MINUS expression | expression TIMES expression | expression DIVIDE expression''' p[0] = BinOp(p[1],p[2],p[3])def p_expression_group(p): 'expression : LPAREN expression RPAREN' p[0] = p[2]def p_expression_number(p): 'expression : NUMBER' p[0] = Number(p[1])

The advantage to this approach is that it may make it easier to attachmore complicated semantics, type checking, code generation, and otherfeatures to the node classes.

To simplify tree traversal, it may make sense to pick a very generictree structure for your parse tree nodes. For example:

class Node: def __init__(self,type,children=None,leaf=None): self.type = type if children: self.children = children else: self.children = [ ] self.leaf = leafdef p_expression_binop(p): '''expression : expression PLUS expression | expression MINUS expression | expression TIMES expression | expression DIVIDE expression''' p[0] = Node("binop", [p[1],p[3]], p[2])

Embedded Actions

The parsing technique used by yacc only allows actions to be executedat the end of a rule. For example, suppose you have a rule likethis:

def p_foo(p): "foo : A B C D" print("Parsed a foo", p[1],p[2],p[3],p[4])

In this case, the supplied action code only executes after all of thesymbols A, B, C, and D have been parsed. Sometimes,however, it is useful to execute small code fragments duringintermediate stages of parsing. For example, suppose you wanted toperform some action immediately after A has been parsed. To dothis, write an empty rule like this:

def p_foo(p): "foo : A seen_A B C D" print("Parsed a foo", p[1],p[3],p[4],p[5]) print("seen_A returned", p[2])def p_seen_A(p): "seen_A :" print("Saw an A = ", p[-1]) # Access grammar symbol to left p[0] = some_value # Assign value to seen_A

In this example, the empty seen_A rule executes immediately afterA is shifted onto the parsing stack. Within this rule, p[-1]refers to the symbol on the stack that appears immediately to the leftof the seen_A symbol. In this case, it would be the value ofA in the foo rule immediately above. Like other rules, avalue can be returned from an embedded action by assigning itto p[0]

The use of embedded actions can sometimes introduce extra shift/reduceconflicts. For example, this grammar has no conflicts:

def p_foo(p): """foo : abcd | abcx"""def p_abcd(p): "abcd : A B C D"def p_abcx(p): "abcx : A B C X"

However, if you insert an embedded action into one of the rules likethis:

def p_foo(p): """foo : abcd | abcx"""def p_abcd(p): "abcd : A B C D"def p_abcx(p): "abcx : A B seen_AB C X"def p_seen_AB(p): "seen_AB :"

an extra shift-reduce conflict will be introduced. This conflict iscaused by the fact that the same symbol C appears next in both theabcd and abcx rules. The parser can either shift the symbol(abcd rule) or reduce the empty rule seen_AB (abcx rule).

A common use of embedded rules is to control other aspects of parsingsuch as scoping of local variables. For example, if you were parsingC code, you might write code like this:

def p_statements_block(p): "statements: LBRACE new_scope statements RBRACE""" # Action code ... pop_scope() # Return to previous scopedef p_new_scope(p): "new_scope :" # Create a new scope for local variables s = new_scope() push_scope(s) ...

In this case, the embedded action new_scope executesimmediately after a LBRACE ({) symbol is parsed.This might adjust internal symbol tables and other aspects of theparser. Upon completion of the rule statements_block, codemight undo the operations performed in the embedded action(e.g., pop_scope()).

Miscellaneous Yacc Notes

  1. By default, yacc.py relies on lex.py for tokenizing. However, an alternative tokenizercan be supplied as follows:

    parser = yacc.parse(lexer=x)

    in this case, x must be a Lexer object that minimally has a x.token() method for retrieving the nexttoken. If an input string is given to yacc.parse(), the lexer must also have an x.input() method.

  2. To print copious amounts of debugging during parsing, use:

    parser.parse(input_text, debug=True)
  3. Since LR parsing is driven by tables, the performance of the parser is largely independent of thesize of the grammar. The biggest bottlenecks will be the lexer and the complexity of the code in your grammar rules.

  4. yacc() also allows parsers to be defined as classes and as closures (see the section on alternative specification oflexers). However, be aware that only one parser may be defined in a single module (source file). There are variouserror checks and validation steps that may issue confusing error messages if you try to define multiple parsersin the same source file.

Multiple Parsers and Lexers

In advanced parsing applications, you may want to have multipleparsers and lexers.

As a general rules this isn’t a problem. However, to make it work,you need to carefully make sure everything gets hooked up correctly.First, make sure you save the objects returned by lex() andyacc(). For example:

lexer = lex.lex() # Return lexer objectparser = yacc.yacc() # Return parser object

Next, when parsing, make sure you give the parse() function areference to the lexer it should be using. For example:

parser.parse(text,lexer=lexer)

If you forget to do this, the parser will use the last lexercreated–which is not always what you want.

Within lexer and parser rule functions, these objects are alsoavailable. In the lexer, the “lexer” attribute of a token refers tothe lexer object that triggered the rule. For example:

def t_NUMBER(t): r'\d+' ... print(t.lexer) # Show lexer object

In the parser, the “lexer” and “parser” attributes refer to the lexerand parser objects respectively:

def p_expr_plus(p): 'expr : expr PLUS expr' ... print(p.parser) # Show parser object print(p.lexer) # Show lexer object

If necessary, arbitrary attributes can be attached to the lexer orparser object. For example, if you wanted to have different parsingmodes, you could attach a mode attribute to the parser object and lookat it later.

Advanced Debugging

Debugging a compiler is typically not an easy task. PLY provides somediagostic capabilities through the use of Python’slogging module. The next two sections describe this:

Debugging the lex() and yacc() commands

Both the lex() and yacc() commands have a debugging mode thatcan be enabled using the debug flag. For example:

lex.lex(debug=True)yacc.yacc(debug=True)

Normally, the output produced by debugging is routed to eitherstandard error or, in the case of yacc(), to a fileparser.out. This output can be more carefully controlled bysupplying a logging object. Here is an example that adds informationabout where different debugging messages are coming from:

# Set up a logging objectimport logginglogging.basicConfig( level = logging.DEBUG, filename = "parselog.txt", filemode = "w", format = "%(filename)10s:%(lineno)4d:%(message)s")log = logging.getLogger()lex.lex(debug=True,debuglog=log)yacc.yacc(debug=True,debuglog=log)

If you supply a custom logger, the amount of debugging informationproduced can be controlled by setting the logging level. Typically,debugging messages are either issued at the DEBUG, INFO, orWARNING levels.

PLY’s error messages and warnings are also produced using the logginginterface. This can be controlled by passing a logging object usingthe errorlog parameter:

lex.lex(errorlog=log)yacc.yacc(errorlog=log)

If you want to completely silence warnings, you can either pass in alogging object with an appropriate filter level or use theNullLogger object defined in either lex or yacc. Forexample:

yacc.yacc(errorlog=yacc.NullLogger())

Run-time Debugging

To enable run-time debugging of a parser, use the debug option toparse. This option can either be an integer (which turnsdebugging on or off) or an instance of a logger object. For example:

log = logging.getLogger()parser.parse(input,debug=log)

If a logging object is passed, you can use its filtering level tocontrol how much output gets generated. The INFO level is used toproduce information about rule reductions. The DEBUG level willshow information about the parsing stack, token shifts, and otherdetails. The ERROR level shows information related to parsingerrors.

For very complicated problems, you should pass in a logging object thatredirects to a file where you can more easily inspect the output afterexecution.

Using Python -OO Mode

Because of PLY’s reliance on docstrings, it is not compatible with-OO mode of the interpreter (which strips docstrings). If youwant to support this, you’ll need to write a decorator or some othertool to attach docstrings to functions. For example:

def _(doc):
def decorate(func):
func.__doc__ = docreturn func

return decorate

@_(“assignment : expr PLUS expr”)def p_assignment(p):

PLY does not provide such a decorator by default.

Where to go from here?

The examples directory of the PLY distribution contains severalsimple examples. Please consult a compilers textbook for the theoryand underlying implementation details or LR parsing.

FAQs

How to use python PLY? ›

To use PLY, copy the ply directory into your project and import lex and yacc from the associated ply subpackage. Alternatively, you can install these files into your working python using make install .

What is Lex and yacc program in Python? ›

The lex.py module is used to break input text into a collection of tokens specified by a collection of regular expression rules. yacc.py is used to recognize language syntax that has been specified in the form of a context free grammar. The two tools are meant to work together.

How to import PLY module in Python? ›

Type “ pip install ply ” (without quotes) in the command line and hit Enter again. This installs ply for your default Python installation. The previous command may not work if you have both Python versions 2 and 3 on your computer. In this case, try "pip3 install ply" or “ python -m pip install ply “.

What are Lex and yacc parsing tools for Python? ›

We all have heard of lex which is a tool that generates lexical analyzer which is then used to tokenify input streams and yacc which is a parser generator but there is a python implementation of these two tools in form of separate modules in a package called PLY.

What is the difference between lex and yacc? ›

Lex is a lexical analysis tool that can be used to identify specific text strings in a structured way from source text. Yacc is a grammar parser; it reads text and can be used to turn a sequence of words into a structured format for processing.

What is yacc used for? ›

The input to yacc describes the rules of a grammar. yacc uses these rules to produce the source code for a program that parses the grammar. You can then compile this source code to obtain a program that reads input, parses it according to the grammar, and takes action based on the result.

Why are Lex and Yacc used for? ›

lex and yacc are a pair of programs that help write other programs. Input to lex and yacc describes how you want your final program to work. The output is source code in the C programming language; you can compile this source code to get a program that works the way that you originally described.

Where is Lex and Yacc used? ›

You use lex and yacc to produce software that analyzes and interprets input. For example, suppose you want to write a simple desk calculator program. Such a desk calculator is easy to create using lex and yacc, and this tutorial shows how one can be put together.

Which parsing technique is used in yacc? ›

The yacc command converts a context-free grammar specification into a set of tables for a simple automaton that executes an LALR(1) parsing algorithm. The grammar can be ambiguous; specified precedence rules are used to break ambiguities.

How do you read PLY data? ›

Reading a PLY file

PLY file elements map onto numpy structured arrays in a pretty obvious way. For a list property in an element, the corresponding numpy field type is object , with the members being numpy arrays (see the vertex_indices example below). The above expression is equivalent to plydata['vertex']. data[0] .

What are the three ways to import modules in Python? ›

The 4 ways to import a module
  • Import the whole module using its original name: import random.
  • Import specific things from the module: from random import choice, randint.
  • Import the whole module and rename it, usually using a shorter variable name: import pandas as pd.
Apr 21, 2021

How do I read a ply file? ›

You can open a PLY file with Microsoft 3D Builder (Windows), Apple Preview (Mac), Blender (multiplatform), or Prisma3D (Android).

How do you parse data in Python? ›

Python parsing is done using various ways such as the use of parser module, parsing using regular expressions, parsing using some string methods such as split() and strip(), parsing using pandas such as reading CSV file to text by using read. csv, etc.

What parser generator does Python use? ›

Making experiments. As the generated C parser is the one used by Python, this means that if something goes wrong when adding some new rules to the grammar you cannot correctly compile and execute Python anymore.

What does parse () do in Python? ›

In Python, the ast. parse() function splits the source code into tokens based on the grammar. These tokens are then transformed to build an Abstract Syntax Tree (AST).

What is $1 in yacc? ›

those $$ , $1 , $3 are the semantic values for for the symbols and tokens used in the rule in the order that they appear. The semantic value is that one that you get in yylval when the scanner gets a new token. $1 has the semantic value of the first num.

How do I write a program using Lex tool? ›

To compile a lex program, do the following:
  1. Use the lex program to change the specification file into a C language program. The resulting program is in the lex. yy. ...
  2. Use the cc command with the -ll flag to compile and link the program with a library of lex subroutines. The resulting executable program is in the a.

How does Lex and Yacc communicate? ›

When we use LEX and YACC together the YACC becomes a high-level routine. It calls the yylex() function of the lexer in order to identify and collect the tokens. These generated tokens can then be demanded by the YACC parser for further arrangement.

Is yacc still used? ›

Yes, this stuff is still used.

What are the three parts of the yacc program? ›

A YACC program consists of three sections: Declarations, Rules and Auxiliary functions.

Is yacc a computer program for operating system? ›

Yacc (Yet Another Compiler-Compiler) is a computer program for the Unix operating system developed by Stephen C. Johnson.

How do you write grammar for yacc? ›

To use the yacc command to generate a parser, provide it with a grammar file that describes the input data stream and what the parser is to do with the data. The grammar file includes rules describing the input structure, code to be invoked when these rules are recognized, and a subroutine to do the basic input.

What is the output of Lex and YACC? ›

While Lex reads the source program one character at a time and converts it into meaningful tokens, Yacc takes the tokens as input and generates a parse tree as output.

What is the structure of lex? ›

A lex program consists of three sections: a section containing definitions, a section containing translations, and a section containing functions. The style of this layout is similar to that of yacc. Throughout a lex program, you can freely use newlines and C-style comments; they are treated as white space.

What are the disadvantages of yacc? ›

Yacc is inflexible in some ways: good error handling is hard (basically, its algorithm is only defined to parse a correct string correctly, otherwise, all bets are off; this is one of the reasons that GCC moved to a hand-written parser)

What are alternatives to yacc? ›

There were two major replacements for Yacc, Berkeley Yacc (byacc) and GNU Bison, the first is in the public domain, while the other obviously uses the GPL License. The original author of both software was Robert Corbett and, confusingly, byacc was originally called bison.

What does |\ N mean in lex? ›

lex uses the standard C escape sequences, including \n for newline.

Which parsing technique is most powerful? ›

Which of the following is the most powerful parsing method? Explanation: Canonical LR is the most powerful parser as compared to other LR parsers.

What are the two types of parsing? ›

There are two types of Parsing:
  • The Top-down Parsing.
  • The Bottom-up Parsing.

What are three basic kind of parsing techniques? ›

Depending upon how the parse tree is built, parsing techniques are classified into three general categories, namely, universal parsing, top-down parsing, and bottom-up parsing. The most commonly used parsing techniques are top-down parsing and bottom-up parsing.

How do you do math on Python? ›

For straightforward mathematical calculations in Python, you can use the built-in mathematical operators, such as addition ( + ), subtraction ( - ), division ( / ), and multiplication ( * ). But more advanced operations, such as exponential, logarithmic, trigonometric, or power functions, are not built in.

How do you make a lexer in Python? ›

First, open a new python repl with whatever name you choose. Then create a function lex. This will be our function that basically does everything :). Then, make a variable code set to input().

References

Top Articles
Latest Posts
Article information

Author: Terence Hammes MD

Last Updated: 26/11/2023

Views: 6607

Rating: 4.9 / 5 (49 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Terence Hammes MD

Birthday: 1992-04-11

Address: Suite 408 9446 Mercy Mews, West Roxie, CT 04904

Phone: +50312511349175

Job: Product Consulting Liaison

Hobby: Jogging, Motor sports, Nordic skating, Jigsaw puzzles, Bird watching, Nordic skating, Sculpting

Introduction: My name is Terence Hammes MD, I am a inexpensive, energetic, jolly, faithful, cheerful, proud, rich person who loves writing and wants to share my knowledge and understanding with you.