Programming Languages

Module 4

Xingang (Ian) Fang

Lexical and Syntax Analysis

Lexical and syntax analysis are key parts of language implementation
Based on formal syntax descriptions (e.g. BNF/EBNF)
Two main phases:
- Lexical analysis
  - Based on a language that describes lexemes
  - small-scale language constructs: lexemes and token codes
- Syntax analysis
  - Based on a language that describes syntax
  - large-scale language constructs: grammar and parse trees
Separate lexical analysis out for simplicity, efficiency and portability

Lexical Analysis

Acts as pattern matcher to identify lexemes (find strings in a language of a certain grammar)
Front-end to syntax analyzer
Lexical analyzer: a.k.a. Lexer, Scanner, Tokenizer
Tasks
- Breaks input char stream into lexemes
- Assigns token codes to lexemes
Three implementation approaches:
1. Describe patterns using formal descriptive language and use tools to generate a lexical analyzer. E.g. Lex, Flex
2. Write code to implement state diagrams
3. Hand-construct table-driven implementation

Lexical Analysis Example

Input: result = old_sum - value / 100;

Lexeme	Token Code
result	IDENT
=	ASSIGN_OP
old_sum	IDENT
-	SUB_OP
value	IDENT
/	DIV_OP
100	Int_LIT
;	SEMICOLON

State Diagrams for Lexical Analysis

State Diagram: directed graphs representing finite automata
- Some naming rule cannot be described by a finite automata
Nodes: states
Arcs: transitions on input characters
Use character classes (e.g., LETTER, DIGIT) to simplify

Credit: Concepts of Programming Languages, R. Sebesta

Convert State Diagrams to Code

Approach 2: use control flow statements
A typical method
- a lex function to get next lexeme and assign token code
- an addChar function to add the current char to end of the current lexeme
- a getChar function to get next character and set token code
- a lookup function to check if the current lexeme is an operator
- a switch statement or if-else chain to manage transitions
TextBook example of a Lexer
- Lexer written in C
- abuse of global variables
- will be easier to work with in object-oriented languages
- outputs token codes rather than token name
- no whitespace handling
- no error handling
- no line number tracking
- think about how to improve!

Quick Check for Lexical Analysis

Where to handle errors, like bad identifier names?
Give one example of bad input?
Where to handle reserved like if, for, etc?
How about if I want to allow underscore in identifier names?

Quick Check Answers for Lexical Analysis

Where to handle errors? Nowhere. Errors are passed to the syntax analysis.
Give one example of bad input? 23abc as a bad identifier.
Where to handle reserved like if, for, etc? End of identifier case in the lex function.
How about if I want to allow underscore in identifier names? In the identifier case in the lex function.

Syntax Analysis (Parsing)

Two tasks:
1. Check input program for syntax correctness
2. Build parse tree for valid programs
Two main approaches:
1. Top-down parsing (LL: left-to-right scan, left-most derivation)
2. Bottom-up parsing (LR: left-to-right scan, right-most derivation)
These methods only work with context-free grammars!
- In practice, require further processing to handle context-sensitive constructs
- Declare before use, type checking, function overloading, scoping, etc. are sign of a context-sensitive grammar

Recursive-descent Parsing

Intuitive conversion from EBNF or BNF
Each terminal means the lexer/scanner returns a lexeme
Each nonterminal has a corresponding function/method
means branching
{}* means repetition (EBNF only)
Unexpected token or end of input: error
The process:
1. Get the next lexeme
2. Call the top-most non-terminal function to start parsing
3. It will recursively call other functions until it reaches a terminal and find a match
Example Link

Limitation of recursive-descent parsing

Problems with LL grammars:
- Left recursion A -> Aa | b causes infinite recursion
- Ambiguity of which rule to expand when A -> aB | aC when the next terminal symbol is a
Solutions:
- Choose other parsing methods
- Reconstruct rules

Bottom-Up Parsing

Builds parse tree from leaves upward
Most common: LR parsing
Uses shift-reduce algorithm:
- Shift: Push next input token onto stack
- Reduce: Replace handle with nonterminal
More powerful than LL but more complex
Usually table-driven implementation

LR Parsing

Key components:
- Parse stack with grammar and state symbols
- ACTION table: Controls shift/reduce decisions
- GOTO table: Determines next state after reduction
- Input buffer with tokens
Advantages:
- Works for larger class of grammars
- Detects errors quickly
- Suitable for all programming languages

Summary

Lexical analysis identifies basic language elements
Syntax analysis verifies structure and builds parse trees
Top-down parsing is simpler but more restricted
Bottom-up parsing is more powerful but complex
Both approaches used in real compilers
Modern parsers use generated parse tables

Shift-Reduce Parser Demonstration

Grammar

The grammar for our expression language:

E → E + T | T
T → T * F | F
F → (E) | id

Left-recursive and ambiguous (not suitable for recursive-descent parsing)
Simplified terminal symbols for a simpler lexer

It models a language:
- +, *, and (): operators
- id: the only terminal symbol allowed
Examples:
- id
- id + id
- id + id * id
- id * id + id
- (id + id) * id

LR(1) Parsing Tables

+-------------------------------------------+ +-------------------------+
|               ACTION TABLE                | |         GOTO TABLE      |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| State | id  | +   | *   | (   | )   | $   | | State | E   | T   | F   |
+=======+=====+=====+=====+=====+=====+=====+ +=======+=====+=====+=====+
| 0     | s5  |     |     | s4  |     |     | | 0     | 1   | 2   | 3   |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 1     |     | s6  |     |     |     |acc  | | 1     |     |     |     |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 2     |     | r2  | s7  |     |     | r2  | | 2     |     |     |     |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 3     |     | r4  | r4  |     | r4  | r4  | | 3     |     |     |     |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 4     | s5  |     |     | s4  |     |     | | 4     | 8   | 2   | 3   |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 5     |     | r6  | r6  |     | r6  | r6  | | 5     |     |     |     |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 6     | s5  |     |     | s4  |     |     | | 6     |     | 2   | 3   |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 7     | s5  |     |     | s4  |     |     | | 7     |     |     | 3   |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 8     |     | s6  |     |     | s9  |     | | 8     |     |     |     |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+
| 9     |     | r5  | r5  |     | r5  | r5  | | 9     |     |     |     |
+-------+-----+-----+-----+-----+-----+-----+ +-------+-----+-----+-----+

Note: Empty cells in the Action table mean error

Example Parse: id + id * id

Parse steps

Stack                  Input         Action              Rule Applied
----------------------------------------------------------------------
0                      id+id*id$     shift 5             -
0 id 5                 +id*id$       reduce by F→id      GOTO(0,F)=3
0 F 3                  +id*id$       reduce by T→F       GOTO(0,T)=2
0 T 2                  +id*id$       reduce by E→T       GOTO(0,E)=1
0 E 1                  +id*id$       shift 6             -
0 E 1 + 6              id*id$        shift 5             -
0 E 1 + 6 id 5          *id$         reduce by F→id      GOTO(6,F)=3
0 E 1 + 6 F 3           *id$         reduce by T→F       GOTO(6,T)=2
0 E 1 + 6 T 2           *id$         shift 7             -
0 E 1 + 6 T 2 * 7 id     $           shift 5             -
0 E 1 + 6 T 2 * 7 id 5   $           reduce by F→id      GOTO(7,F)=3
0 E 1 + 6 T 2 * 7 F 3    $           reduce by T→T*F     GOTO(6,T)=2
0 E 1 + 6 T 2            $           reduce by E→E+T     GOTO(0,E)=1
0 E 1                    $           accept              -
----------------------------------------------------------------------

How to Use the Tables

ACTION Table Usage:
- ‘s’ means shift: push symbol and next state
- ‘r’ means reduce: pop all RHS items, push LHS and new state from GOTO
- ‘acc’ means accept the input
GOTO Table Usage:
- After reduction, use current state and non-terminal to find next state
- Only used after reduce actions
Parse Process:
1. Look at current state (top of state stack) and next input symbol
2. Find action in ACTION table
3. For reduce actions, pop everything to be reduced and use GOTO table with current state (after popping) and non-terminal reduced to
4. Continue until accept or error
Error Handling:
- Error when no valid action exists in table
- Blank entries (-) in tables indicate error
- Parser stops on error