Programming Languages
Module 4
Xingang (Ian) Fang
Lexical and Syntax Analysis
Lexical and syntax analysis are key parts of language implementation
Based on formal syntax descriptions (e.g. BNF/EBNF)
Two main phases:
Lexical analysis
Based on a language that describes lexemes
small-scale language constructs: lexemes and token codes
Syntax analysis
Based on a language that describes syntax
large-scale language constructs: grammar and parse trees
Separate lexical analysis out for simplicity, efficiency and portability
Lexical Analysis
Acts as pattern matcher to identify lexemes (find strings in a language of a certain grammar)
Front-end to syntax analyzer
Lexical analyzer: a.k.a. Lexer, Scanner, Tokenizer
Tasks
Breaks input char stream into lexemes
Assigns token codes to lexemes
Three implementation approaches:
Describe patterns using formal descriptive language and use tools to generate a lexical analyzer. E.g. Lex, Flex
Write code to implement state diagrams
Hand-construct table-driven implementation
Lexical Analysis Example
Input: result = old_sum - value / 100;
Lexeme |
Token Code |
|---|---|
result |
IDENT |
= |
ASSIGN_OP |
old_sum |
IDENT |
- |
SUB_OP |
value |
IDENT |
/ |
DIV_OP |
100 |
Int_LIT |
; |
SEMICOLON |
State Diagrams for Lexical Analysis
State Diagram: directed graphs representing finite automata
Some naming rule cannot be described by a finite automata
Nodes: states
Arcs: transitions on input characters
Use character classes (e.g., LETTER, DIGIT) to simplify
Credit: Concepts of Programming Languages, R. Sebesta
Convert State Diagrams to Code
Approach 2: use control flow statements
A typical method
a lex function to get next lexeme and assign token code
an addChar function to add the current char to end of the current lexeme
a getChar function to get next character and set token code
a lookup function to check if the current lexeme is an operator
a switch statement or if-else chain to manage transitions
TextBook example of a Lexer
Lexer written in C
abuse of global variables
will be easier to work with in object-oriented languages
outputs token codes rather than token name
no whitespace handling
no error handling
no line number tracking
think about how to improve!
Quick Check for Lexical Analysis
Where to handle errors, like bad identifier names?
Give one example of bad input?
Where to handle reserved like if, for, etc?
How about if I want to allow underscore in identifier names?
Quick Check Answers for Lexical Analysis
Where to handle errors? Nowhere. Errors are passed to the syntax analysis.
Give one example of bad input? 23abc as a bad identifier.
Where to handle reserved like if, for, etc? End of identifier case in the lex function.
How about if I want to allow underscore in identifier names? In the identifier case in the lex function.
Syntax Analysis (Parsing)
Two tasks:
Check input program for syntax correctness
Build parse tree for valid programs
Two main approaches:
Top-down parsing (LL: left-to-right scan, left-most derivation)
Bottom-up parsing (LR: left-to-right scan, right-most derivation)
These methods only work with context-free grammars!
In practice, require further processing to handle context-sensitive constructs
Declare before use, type checking, function overloading, scoping, etc. are sign of a context-sensitive grammar
Recursive-descent Parsing
Intuitive conversion from EBNF or BNF
Each terminal means the lexer/scanner returns a lexeme
Each nonterminal has a corresponding function/method
- means branching
{}* means repetition (EBNF only)
Unexpected token or end of input: error
The process:
Get the next lexeme
Call the top-most non-terminal function to start parsing
It will recursively call other functions until it reaches a terminal and find a match
Example Link
Limitation of recursive-descent parsing
Problems with LL grammars:
Left recursion
A -> Aa | bcauses infinite recursionAmbiguity of which rule to expand when
A -> aB | aCwhen the next terminal symbol is a
Solutions:
Choose other parsing methods
Reconstruct rules
Bottom-Up Parsing
Builds parse tree from leaves upward
Most common: LR parsing
Uses shift-reduce algorithm:
Shift: Push next input token onto stack
Reduce: Replace handle with nonterminal
More powerful than LL but more complex
Usually table-driven implementation
LR Parsing
Key components:
Parse stack with grammar and state symbols
ACTION table: Controls shift/reduce decisions
GOTO table: Determines next state after reduction
Input buffer with tokens
Advantages:
Works for larger class of grammars
Detects errors quickly
Suitable for all programming languages
Summary
Lexical analysis identifies basic language elements
Syntax analysis verifies structure and builds parse trees
Top-down parsing is simpler but more restricted
Bottom-up parsing is more powerful but complex
Both approaches used in real compilers
Modern parsers use generated parse tables
Shift-Reduce Parser Demonstration
Grammar
The grammar for our expression language:
E → E + T | T T → T * F | F F → (E) | idLeft-recursive and ambiguous (not suitable for recursive-descent parsing)
Simplified terminal symbols for a simpler lexer
It models a language:
+, *, and (): operators
id: the only terminal symbol allowed
Examples:
id
id + id
id + id * id
id * id + id
(id + id) * id
LR(1) Parsing Tables
+-------------------------------------------+ +-------------------------+
| ACTION TABLE | | GOTO TABLE |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| State | id | + | * | ( | ) | $ | | State | E | T | F |
+=======+=====+=====+=====+=====+=====+=====+ +=======+=====+=====+=====+
| 0 | s5 | | | s4 | | | | 0 | 1 | 2 | 3 |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 1 | | s6 | | | |acc | | 1 | | | |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 2 | | r2 | s7 | | | r2 | | 2 | | | |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 3 | | r4 | r4 | | r4 | r4 | | 3 | | | |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 4 | s5 | | | s4 | | | | 4 | 8 | 2 | 3 |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 5 | | r6 | r6 | | r6 | r6 | | 5 | | | |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 6 | s5 | | | s4 | | | | 6 | | 2 | 3 |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 7 | s5 | | | s4 | | | | 7 | | | 3 |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 8 | | s6 | | | s9 | | | 8 | | | |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 9 | | r5 | r5 | | r5 | r5 | | 9 | | | |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
Note: Empty cells in the Action table mean error
Example Parse: id + id * id
Parse steps
Stack Input Action Rule Applied
----------------------------------------------------------------------
0 id+id*id$ shift 5 -
0 id 5 +id*id$ reduce by F→id GOTO(0,F)=3
0 F 3 +id*id$ reduce by T→F GOTO(0,T)=2
0 T 2 +id*id$ reduce by E→T GOTO(0,E)=1
0 E 1 +id*id$ shift 6 -
0 E 1 + 6 id*id$ shift 5 -
0 E 1 + 6 id 5 *id$ reduce by F→id GOTO(6,F)=3
0 E 1 + 6 F 3 *id$ reduce by T→F GOTO(6,T)=2
0 E 1 + 6 T 2 *id$ shift 7 -
0 E 1 + 6 T 2 * 7 id $ shift 5 -
0 E 1 + 6 T 2 * 7 id 5 $ reduce by F→id GOTO(7,F)=3
0 E 1 + 6 T 2 * 7 F 3 $ reduce by T→T*F GOTO(6,T)=2
0 E 1 + 6 T 2 $ reduce by E→E+T GOTO(0,E)=1
0 E 1 $ accept -
----------------------------------------------------------------------
How to Use the Tables
ACTION Table Usage:
‘s’ means shift: push symbol and next state
‘r’ means reduce: pop all RHS items, push LHS and new state from GOTO
‘acc’ means accept the input
GOTO Table Usage:
After reduction, use current state and non-terminal to find next state
Only used after reduce actions
Parse Process:
Look at current state (top of state stack) and next input symbol
Find action in ACTION table
For reduce actions, pop everything to be reduced and use GOTO table with current state (after popping) and non-terminal reduced to
Continue until accept or error
Error Handling:
Error when no valid action exists in table
Blank entries (-) in tables indicate error
Parser stops on error