07-01: Predictive Parsing Part 1

Like recursive-descent but parser can “predict” which production to use:
- By looking at the next few tokens (lookahead);
- No backtracking.
Predictive parsers accept LL(k) grammars(Left-to-right Left-most derivation k tokens lookahead);
In recursive-descent:
- At each step, many choices of production to use;
- Backtracking used to undo bad choices.
In LL(1):
- At each step, only one choice of production.
But can require left-factoring to eliminate the common prefix of multiple productions for one non terminal.

07-01: Predictive Parsing Part 2

Algorithm to use parsing table:
- Method similar to recursive descent, except:
  1. For the leftmost non-terminal S;
  2. We look at the next input token a;
  3. And choice the production shown at [S, a].
- A stack records frontier of parse tree:
  1. Non-terminals that have yet to be expanded;
  2. Terminals that have yet to matched against the input;
  3. Top of stack = leftmost pending terminal or non-terminal.
- Reject on reaching error state;
- Accept on end of input & empty stack.

Code:

initialize stack = < S, $ > and next
repeat
  case stack of
 < X, rest > :
   If T[X, *next] = Y1...Yn
     then stack <- < Y1...Yn rest >;
   else error();
 < t, rest > :
   If t == *next++
     then stack <- < rest >;
   else error();
until stack == < >

07-02: First Sets

Consider non-terminal A, production $A \to α$ , & token t;
T[A, t] = α, in two cases:
- If $α \to\ast tβ$ :
  1. α can derive a t in the first position;
  2. We say that $t \isin First(α).$
- if $A \to α$ and $α \to\ast ε$ and $S \to\ast βAtδ$ :
  1. Useful if stack has A, input is t, and A cannot derive t;
  2. In this case only option is to get rid of A (by deriving ε):
    - Can work only if t can follow A in at least one derivation.
  3. We say $t \isin Follow(A).$
Definition: $First(X) = {t\ \vert\ X\ \to\ast\ tα} \cup {ε\ \vert\ X\ \to\ast ε}$ , X - arbitrary string.
Algorithm sketch:
- First(t) = {t}, t - terminal;
- $ε \isin First(X)$ :
  1. if $X \to ε$ ;
  2. if $X \to A_1 ... A_n\ and\ ε \isin First(A_i)\ for\ 1 \leq i \leq n$ .
- $First(α) \subseteq First(X)\ if\ X \to A_1 .. A_{nα}\ and\ ε \isin First(A_i)\ for\ 1 \leq i \leq n$ .

07-03: Follow Sets

Definition: $Follow(X) = \{t\ \vert\ S \to\ast βXtδ\}$ .
Intuition:
- if $X \to AB$ then First(B) ⊆ Follow(A) and Follow(X) ⊆ Follow(B);
- if $B \to\ast ε$ then $Follow(X) \subseteq Follow(A)$ .
- if S is the start symbol then $\$ \isin Follow(S)$ .
Algorithm sketch:
- $\$ \isin Follow(S)$ ;
- $First(β) - \{ε\} \subseteq Follow(X)$
  1. For each production $A \to αXβ$ .
- $Follow(A) \subseteq Follow(X)$
  1. For each production $A \to αXβ$ where $ε \isin First(β)$ .

07-04: LL1 Parsing Tables

Construct a parsing table T for CGG G:
For each production $A \to α$ in G do:
- For each terminal $t \isin First(α)$ do:
  1. $T[A, t] = α$
- If $ε \isin First(α)$ , for each $t \isin Follow(A)$ do:
  1. $T[A, t] = α$
- If $ε \isin First(α)$ and $\$ \isin Follow(A)$ do:
  1. $T[A, \$] = α$
If any entry in the parsing table is multiply defined then G is not LL(1). A grammar is definitely not LL(1) if it is not left factored, is left recursive and ambiguous;
Most programming languages GFGs are not LL(1).

07-05: Bottom-Up Parsing

Bottom-up parsing is more general than (deterministic) top-down parsing
- And just as efficient;
- Builds on ideas in top-down parsing.
Bottom-up is the preferred method;
Bottom-up parsers don’t need left-factored grammars;
Bottom up parsing reduces a string to the start symbol by inverting productions;
The productions, read backwards, trace a rightmost derivation;
A bottom-up parser traces a rightmost derivation in reverse.

07-06: Shift-Reduce Parsing

Let αβω be a step of a bottom-up parse;
Assume the next reduction is by $X \to β$ ;
Then ω is a string of terminals;
Idea: Split string into two substrings:
- Right substring is as yet unexamined by parsing;
- Left substring has terminals and non-terminals;
- The dividing point is marked by a |.
Shift: Move | one place to the right:
- Shift a terminal to the left string (read a terminal).
Reduce: Apply an inverse production at the right end of the left string:
- If $A \to xy$ is a production, then $Cbxy\ \vert\ ijk\ \to\ CbA\ \vert\ ijk$ .
Left string can be implemented by a stack:
- Top of the stack is the |.
Shift pushes a terminal on the stack;
Reduce:
- pops symbols off of the stack;
- pushes a non-terminal on the stack.
In a given state, more than one action (shift or reduce) may lead to a valid parse;
If it is legal to shift or reduce, there is a shift-reduce conflict;
If it legal to reduce by two different productions, there is a reduce-reduce conflict.

08-01: Handles

Intuition: Want to reduce only if the result can still be reduced to the start symbol:
- Assume a rightmost derivation $S \to\ast αXω \to αβω$ ;
- Then αβ is a handle of αβω.
Handles formalize the intuition:
- A handle is a reduction that also allows further reductions back to the start symbol.
We only want to reduce at handles;
In shift-reduce parsing, handles appear only at the top of the stack, never inside;
Handles are never to the left of the rightmost non-terminal:
- Therefore, shift-reduce moves are sufficient; the | need never move left.
Bottom-up parsing algorithms are based on recognizing handles.

08-02: Recognizing Handles

α is a viable prefix if there is an ω such that $α \vert ω$ is a state of a shift-reduce parser:
- A viable prefix does not extend past the right end of the handle;
- It’s a viable prefix because it is a prefix of the handle;
- As long as a parser has viable prefixes on the stack no parsing error has been detected.
For any grammar, the set of viable prefixes is a regular language;
An item is a production with a ”.” somewhere on the rhs:
- The only item for $X \to ε$ is $X -> .$
- Items are ofter called LR(0) items.
The stack may have many prefixes of rhs’s: Prefix(1) Prefix(2) … Prefix(n-1) Prefix(n)
Let Prefix(i) be a prefix of rhs of $X(i) \to α(i)$ :
- Prefix(i) will eventually reduce to X(i);
- The missing part of α(i-1) start with X(i);
- i.e $X(i-1) \to Prefix(i-1)X(i)β$ for some β.
Recursively, Prefix(k+1) … Prefix(n) eventually reduces to the missing part of α(k);
Idea: To recognize viable prefixes, we must:
- Recognize a sequence of partial rhs’s of productions, where
- Each partial rhs can eventually reduce to part of the missing suffix of its predecessor.

08-03: Recognizing Viable Prefixes

Add a dummy production $S' \to S$ to G;
The NFA states are the items of G:
- Including the extra production.
For item $E \to α.Xβ$ add transition $E \to α.Xβ \to_{X} E \to αX.β$ ;
For item $E \to α.Xβ$ and production $X \to γ$ add:
- $E -> α.Xβ* ->_{ε} X -> .γ$
Every state is an accepting state;
Start state is $S' \to .S$ .

08-04: Valid Items

Item $X \to β.γ$ is valid for a viable prefix αβ if $S' \to\ast αXω \to αβγω$ by a right-most derivation;
After parsing αβ, the valid items are the possible tops of the stack of items;
An item I is valid for a viable prefix α if the DFA recognizing viable prefixes terminates on input α in a state s containing I;
The items in s describe what the top of item stack might be after reading input α.

08-05: SLR Parsing

LR(0) Parsing: Assume:
- stack contains α;
- next input is t;
- DFA on input α terminates in state s.
Reduce by $X -> β$ if:
- s contains item $X \to β.$
Shift if:
- s contains item $X \to β.tω$ ;
- equivalent to saying s has a transition labeled t.
LR(0) has a reduce/reduce conflict if:
- Any state has two reduce items:
1. $X \to β.$ and $X \to ω.$
LR(0) has a shift/reduce conflict if:
- Any state has a reduce item and a shift item:
1. $X \to β.$ and $Y \to ω.tδ$
SLR Idea:
- Reduce by $X \to β$ if:
1. s contains item $X \to β.$
2. $t \isin Follow(X)$
SLR Parsing Algorithm:
- Let M be DFA for viable prefix of G;
- Let $\vert\ x_1 ... x_n\$$ be initial configuration;
- Repeat until configuration is S|$:
1. Let $α\ \vert\ ω$ be current configuration;
2. Run M on current stack α;
3. if M rejects α, report parsing error:
  - Stack α is not a viable prefix.
4. if M accepts α with item I, let a be next input:
  - Shift if $X\ \to\ β.aγ\ \isin\ I$ ;
  - Reduce if $X \to β.\ and\ a \isin Follow(X)$ ;
  - Report parsing error if neither applies.
    - if there is a conflict in the last step, grammar is not SLR(k).

08-07: SLR Improvements

Rerunning the viable prefix automaton on the stack at each step is wasteful;
Remember the state of the automaton on each prefix of the stack;
Change stack to contain pairs <Symbol, DFA state>;
For a stack <sym(1), state(1)> … <sym(n), state(n)> state(n) is the final state of the DFA on sym(1) … sym(n);
Detail: The bottom of the stack is <any, start> where:
- any is any dummy symbol;
- start is the start state of the DFA.
Define goto[i, A] = j if $state(i) \to_{A} state(j)$ ;
Goto is just the transition function of the DFA:
- one of two parsing tables.
For each state s(i) and terminal a:
- If s(i) has item $X \to α.aβ$ and goto[i, a] = j then action[i, a] = shift j;
- If s(i) has item $X \to α.$ and $a \isin Follow(X)$ and $X \neq S'$ then $action[i, a] = reduce X \to α$ ;
- If s(i) has item $S' \to S.$ then action[i, $] = accept;
- Otherwise, action[i, a] = error.

Algorithm:

Let I = w$ be initial input
Let j = 0
Let DFA state 1 have item S' -> .S
Let stack = <dummy, 1>
repeat
 case action[top_state(stack), I[j]] of
   shift k: push <I[j++], k>
   reduce X -> A:
     pop |A| pairs,
     push <X, goto[top_state(stack), X]>
   accept: halt normally
   error: halt and report error

Some common constructs are not SLR(1);
LR(1) is more powerful:
- Build lookahead into the items;
- $[T \to .int*T,\$]$ means:
  1. After seeing $T \to_{int}\ast T$ reduce if lookahead is $.
- More accurate than just using follow sets.