\authorinfo
Steven Obua
Recursive Mind (www.recursivemind.com)
[email protected]
Parameterized Local Lexing
Abstract
Building on the concept of local lexing the concept of parameterized local lexing is introduced.
1 Motivation
We previously introduced local lexing as a new semantics for integrated lexing and parsing, and provided an algorithm (based on Earley’s Algorithm) for it locallexing . This algorithm can handle arbitrary context-free grammars. In principle it can even deal with infinite grammars, but to harness this capability in practice we need a finite representation of such possibly infinite grammars. Parameterized local lexing delivers such a finite representation, and comes with an algorithm tuned to that representation. Of course the finite representation as presented in this paper is not the only possible one, but it is simple and intuitive and yet powerful.
2 Definition
We call a tuple
[TABLE]
a parameterized local lexing iff:
Φ is a non-empty set of parameters.
N and T are disjoint sets of nonterminals and terminals.
S∈N and ϕ\textslstart∈Φ are the start symbol and start parameter, respectively.
R is a set of parameterized rules. Each rule has the form
[TABLE]
where \textslN,\textslX1,…,\textslXk∈N, and
f1,…,fk,fk+1
are partial functions with the signatures f_{i}:\Phi^{2i-1}\mathrel{\ooalign{\hfil\mapstochar\mkern 5.0mu\hfil\cr\to\cr}}\Phi for i∈{1,…,k+1}.
Σ is a set of characters.
Lex is a function which for each (t,α,D,k) such that
t∈T, α∈Φ, D∈Σ∗ and k∈{0,…,∣D∣} returns a set consisting of tuples
(t,α,β,c) such that β∈Φ, c∈Σ∗, k+∣c∣≤∣D∣ and
ci=Dk+i for all i such that 0≤i≤∣c∣−1.
Sel takes two sets A,B⊆T×Φ×Φ×Σ∗
such that A⊆B and returns a set
\textslSel(A,B) such that A⊆\textslSel(A,B)⊆B.
3 Semantics
The semantics of parameterized local lexing is provided by translating it to ordinary local lexing.
The result is a context-free grammar (N,T,R,⊤)
together with a lexer \textslLex and selector \textslSel operating on the same character set Σ as the parameterized local lexing. The resulting nonterminals and terminals of the new grammar are defined as follows:
[TABLE]
For elements (N,α,β)∈N and (t,α,β)∈T we also
write Nβα and tβα, respectively. Here α is called the input parameter, and β is called
the output parameter.
The nonterminal ⊥ represents failure and therefore does not appear on the left hand side of any rule in R.
The nonterminal ⊤ is the start symbol of the new grammar and appears only on the left hand side of the rules in the following subset of R:
[TABLE]
Given u≥1 partial functions f1,…,fu with signatures f_{i}:\Phi^{2i-1}\mathrel{\ooalign{\hfil\mapstochar\mkern 5.0mu\hfil\cr\to\cr}}\Phi for i∈{1,…,u}, we define the set
⟨f1,…,fu⟩ to consist of all sequences ρ∈Φ2u such that
[TABLE]
With each parameterized rule r=\textslNfk+1→\textslX1f1…\textslXkfk we associate
its induced set of rules r⊆R. Firstly, r contains all rules
[TABLE]
such that αα1β1…αkβkβ∈⟨f1,…,fk+1⟩.
Secondly, we also need to somehow take into account those cases where some fi happens to be undefined on its arguments.
While these cases do not affect the language L
of the induced context-free grammar, they possibly do affect its prefix language Lprefix. We choose to let r contain also all rules
[TABLE]
such that
[TABLE]
The induced set of rules R is thus defined as
[TABLE]
Defining the induced lexer is straightforward:
[TABLE]
There is an obvious bijection U:T×Φ×Φ×Σ∗→T×Σ∗
between parameterized tokens and ordinary tokens, given by U(t,α,β,c)=(tβα,c). Lifted to a bijection between token sets, we use it to define the induced selector:
[TABLE]
A character sequence D∈Σ∗ is said to be in the character language of the parameterized local lexing iff ℓℓ(D)=∅ with respect to the induced local lexing. Furthermore, we can associate with each D∈Σ∗ the set of its outputs σ(D)⊆Φ via
[TABLE]
Obviously, ℓℓ(D)=∅ iff σ(D)=∅.
4 Algorithm
Given Φ, N, T and R are all finite, parameterized local lexing can be implemented simply by reducing it via its semantics to ordinary local lexing, and then applying the Earley-based local lexing algorithm (ELLA). This is rather awkward though as
the induced grammar is potentially much larger than the original grammar. In particular, ∣R∣ might be much larger than ∣R∣. Furthermore, we would also like to have the option of doing parameterized local lexing for an infinite set of parameters Φ like the set of natural numbers, but usually this would lead to an infinite set of rules R.
Luckily it is straightforward to derive from ELLA an algorithm for parameterized local lexing (PELLA) which avoids such a translation to ordinary local lexing but works directly on parameterized items. A parameterized item is a tuple (r,d,i,j,ρ),
such that r=\textslNfk+1→\textslX1f1…\textslXkfk∈R, 0≤d≤k and 0≤i≤j≤∣D∣, where D∈Σ∗ is the input under consideration. The sequence ρ∈Φ2(d+1) records the choices of parameters that have been made so far, and therefore we demand that
the invariant ρ∈⟨f1,…,fd+1⟩ holds for each parameterized item.
Before describing how PELLA operates on parameterized items, let us establish a correspondence between parameterized items and the items of the induced ordinary local lexing. Just how each parameterized rule corresponds to a set of ordinary rules, each parameterized item x corresponds to a set of ELLA items x. Firstly, for each rule
q=\textslNβα→\textslX1β1α1…\textslXkβkαk∈r the set x contains all items
(q,d,i,j) such that
[TABLE]
Here takens denotes the sequence resulting from taking the first n elements of the sequence s. Secondly, for each rule
q=\textslNβα→\textslX1β1α1…\textslXhβhαh ⊥∈r the set x contains all items
(q,d,i,j) such that d≤h−1 and
[TABLE]
Theorem 4.1**.**
Let x be a parameterized item. Then x=∅.
Proof.
Let x=(r,d,i,j,ρ), and r=\textslNfk+1→\textslX1f1…\textslXkfk.
We know that ρ∈⟨f1,…,fd+1⟩.
If there is ρ′∈Φ2(k−d) such that
ξ=ρρ′∈⟨f1,…,fk+1⟩ then x is not empty as
(\textslNξ2k+1ξ0→\textslXξ2ξ1…\textslXξ2kξ2k−1,d,i,j)∈x.
Otherwise there must be h with d+1≤h<k and ρ′∈Φ2(h−d−1) such that
ξ=ρρ′∈⟨f1,…,fh⟩, together with some γ∈Φ such that
fh+1 is undefined at ξγ. This means that for any β∈Φ, we have
(\textslNβξ0→\textslXξ2ξ1…\textslXγξ2h−1⊥,d,i,j)∈x.
∎
Given an item set I, we define its induced item set I as
[TABLE]
The idea of PELLA is that it operates on item sets I just as ELLA would operate on I. For each of the original building blocks
Init′, Predict′, Complete′, Tokens′ and Scan′ of ELLA we define the corresponding
building blocks Init, Predict, Complete, Tokens and Scan of PELLA as shown in Figure 1
(the use of the • notation for PELLA items is analogous to its use for ELLA items).
Apart from this adaptation of the building blocks to a parameterized setting, PELLA is defined via exactly the same equations as ELLA is, as shown in Figure 2.
Theorem 4.2** (Correctness of PELLA).**
[TABLE]
Indeed, we can prove the correctness of PELLA by exploiting the fact that PELLA simulates ELLA. The proof works by explaining in detail what we mean by this simulation.
To this end, we need a notion of correspondence between PELLA item sets I and ELLA item sets I′. Intuitively, I=I′ is an obvious candidate for
such a relation, but this does not work: I′ may contain items of the form (⊤→w1\makebox[7.0pt]\textbulletw2,i,j) or
(N→w\makebox[7.0pt]\textbullet⊥,i,j), both of which will never be in I. Therefore, we define the correspondence I∼I′ instead as I=Norm(I′)
where Norm(I′)={(N→w1\makebox[7.0pt]\textbulletw2,i,j)∈I′∣N=⊤∧w2=⊥}.
Indeed, with this choice of correspondence the formulas in Figure 3 hold and show that the building blocks of PELLA simulate the building blocks of ELLA.
In the following we will prove the formulas in Figure 3.
Theorem 4.3**.**
[TABLE]
Proof.
From the definitions it follows that
[TABLE]
We first show Norm(Predict′0Init′)⊆Init.
So assume that (Sβϕ\textslstart→\makebox[7.0pt]\textbulletw,0,0)∈Norm(Predict′0Init′). There must be r∈R such that
(Sβϕ\textslstart→w)∈r, and r=Sfk+1→\textslX1f1…\textslXkfk for some f1,…,fk+1
and some \textslX1,…,\textslXk.
If w does not end with ⊥, we know there must be α1,…,αk and β1,…,βk such that
(Sβϕ\textslstart→w)=(Sβϕ\textslstart→\textslX1β1α1…\textslXkβkαk), and we also know then that f1 is defined on ϕ\textslstart, and that
either k=0∧β=f1(ϕ\textslstart) or k>0∧α1=f1(ϕ\textslstart).
For the parameterized item x=(Sfk+1→\makebox[7.0pt]\textbullet\textslX1f1…\textslXkfk,0,0,ϕ\textslstartf1(ϕ\textslstart)) this implies
that x∈Init, and that (Sβϕ\textslstart→\makebox[7.0pt]\textbulletw,0,0)∈x⊆Init.
If on the other hand w ends with ⊥, then k>0 and there must be h∈{1,…,k}, α1,…,αh and β1,…,βh such that
(Sβϕ\textslstart→w)=(Sβϕ\textslstart→\textslX1β1α1…\textslXhβhαh⊥). We also know then that f1 is defined on ϕ\textslstart, and that
α1=f1(ϕ\textslstart). For the parameterized item x=(Sfk+1→\makebox[7.0pt]\textbullet\textslX1f1…\textslXkfk,0,0,ϕ\textslstartf1(ϕ\textslstart)) this implies that x∈Init, and that (Sβϕ\textslstart→\makebox[7.0pt]\textbulletw,0,0)∈x⊆Init.
We now prove Init⊆Norm(Predict′0Init′). So assume x′∈Init.
Then there must be x∈Init such that x′∈x. We know that
[TABLE]
for some \textslX1,…,\textslXu and some f1,…,fu. From this x′∈Norm(Predict′0Init′)
immediately follows.
∎
Theorem 4.4**.**
Let x be a parameterized item such that
[TABLE]
(and v≥d+1). Then for each γ∈Φ there is x′∈x with
[TABLE]
for some β∈Φ and some w∈(N∪T)∗.
Proof.
The proof is basically the same as the proof of Theorem 4.1, only that instead of extending ρ we extend ργ this time.
∎
Theorem 4.5**.**
Assume (\textslNβα→w)∈R.
If w ends with ⊥ then this implies ∣w∣≥2.
Furthermore, for any d with
[TABLE]
and for any 0≤i≤j≤∣D∣ there is a parameterized item
[TABLE]
such that (\textslNβα→w,d,i,j)∈x.
Proof.
Assume w does not end with ⊥. Then there is a parameterized rule
r=\textslNf∣w∣+1→\textslX1f1…\textslX∣w∣f∣w∣∈R
such that (\textslNβα→w)∈r, and there are α1,β1,…,α∣w∣,β∣w∣ such that
w=\textslX1β1α1…\textslX∣w∣β∣w∣α∣w∣ and
αα1β1…α∣w∣β∣w∣β∈⟨f1,…,f∣w∣+1⟩. Setting
x=(r,d,i,j,ρ) where ρ=take2(d+1)(αα1β1…α∣w∣β∣w∣β)
yields the desired result.
On the other hand, assume w ends with ⊥. Then there is a parameterized rule
r=\textslNfk+1→\textslX1f1…\textslXkfk∈R
such that (\textslNβα→w)∈r and 1≤∣w∣−1≤k. In particular there are
α1,β1,…,α∣w∣−1,β∣w∣−1 such that w=\textslXβ1α1…\textslXβ∣w∣−1α∣w∣−1⊥
and αα1β1…α∣w∣−1∈⟨f1,…,f∣w∣−1⟩.
Setting x=(r,d,i,j,ρ) where ρ=take2(d+1)(αα1β1…α∣w∣−1)
yields the desired result in this case as well.
∎
Theorem 4.6**.**
Assume I∼I′. Then
[TABLE]
Proof.
We first prove PredictkI⊆Norm(Predict′kI′), so assume
x′∈PredictkI. Then there must be x∈PredictkI such that
x′∈x. If x∈I then because of I∼I′ it automatically follows that
x′∈Norm(I′)⊆Norm(Predict′kI′). So assume x∈/I. This means that
[TABLE]
and there is a parameterized item y where
[TABLE]
such that \textslXd+1=\textslM. From x′∈x it follows that
[TABLE]
for some γ∈Φ and some a∈(N∪T)∗. From Theorem 4.4 we obtain
an item y′ with
[TABLE]
for some β∈Φ and some w∈(N∪T)∗. Because I∼I′ implies
y′∈I′ it follows immediately that x′∈Predict′kI′. As x′ is an induced item, this implies
x′∈Norm(Predict′kI′).
Now we prove Norm(Predict′kI′)⊆PredictkI. Assume
x′∈Norm(Predict′kI′). If x′∈I′ then x′∈I and we are done, so assume x′∈/I′.
This means that x′=(\textslMγα→\makebox[7.0pt]\textbulleta,k,k) for some \textslMγα→a∈R,
and there is y′=(\textslNδβ→b\makebox[7.0pt]\textbullet\textslMγα,i,k)∈I′. This means that there is
y∈I with y′∈y. From Theorem 4.5 we obtain an x with x′∈x and it follows straight
from the definitions that x∈Predictk{y}.
∎
Theorem 4.7**.**
Assume I∼I′. Then
[TABLE]
Proof.
We first prove CompletekI⊆Norm(Complete′kI′), so assume
x′∈CompletekI. Then there must be x∈CompletekI such that
x′∈x. We can assume x∈/I as otherwise the result follows immediately. This means that there are parameterized items y and z
such that x, y and z have the shape
[TABLE]
such that \textslXd+1=M and ρ2d+1=ξ0
and ρ′=ρξ2u+1.
Moving the dot in x′ one position to the left results in an item y′, and obviously y′∈y.
From Theorem 4.1 we obtain z′∈z. Because of I∼I′ we have
{y′,z′}⊆Norm(I′). Furthermore it is easy to see that
x′∈Complete′k{y′,z′}. This concludes this direction of the proof.
For the other direction
assume x′∈Norm(Complete′kI′). We need to show that x′∈CompletekI.
We can assume x′∈/I′ as otherwise x′∈I follows immediately. Then there exist y′,z′∈I′
such that x′, y′ and z′ have the shape
[TABLE]
where a=⊥ as x′∈Norm(…). Because of I∼I′ there exist y,z∈I with y′∈y, z′∈z and
[TABLE]
Because of a=⊥ we know that fd+2 is defined on ρ′=ρβd+1 and thus x∈Completek{y,z} where
[TABLE]
It is clear that x′∈x, which concludes the proof.
∎
Theorem 4.8**.**
Assume I∼I′. Then
[TABLE]
Proof.
Let us first introduce two abbreviations V and V′:
[TABLE]
Unfolding the definitions of Tokens and Tokens′ shows that the theorem statement is equivalent to
[TABLE]
which according to the definition of \textslSel is equivalent to
[TABLE]
This means that if we can prove
[TABLE]
we have proven the theorem.
We first prove U(V)⊆V′, so assume (tβα,c)∈U(V), or equivalently
(t,α,β,c)∈V. This means that there is a parameterized item
[TABLE]
with \textslXd+1=t, ρ2d+1=α and (t,α,β,c)∈\textslLex(t,α,D,k).
Theorem 4.4 yields an item x′∈x⊆I⊆I′ such that
[TABLE]
for some δ and some w. This proves (tβα,c)∈V′.
To prove V′⊆U(V), assume (tβα,c)∈V′. Then there must be an item
[TABLE]
such that (tβα,c)∈\textslLex(tβα)(D,k), or equivalently
[TABLE]
Because the dot in x′ appears before a terminal symbol we know that x′∈Norm(I′), thus there is an x∈I with
[TABLE]
such that x′∈x. This implies \textslX∣a∣+1=t and ρ2∣a∣+1=α. This proves (t,α,β,c)∈V.
∎
Theorem 4.9**.**
Assume I∼I′. Then
[TABLE]
Proof.
The proof is straightforward and similar to the proof of Theorem 4.7.
∎
Apart from using parameterized building blocks, PELLA as shown in Figure 2 is identical to ELLA as shown in Figure 4. Given that we now know that
all of PELLA’s building blocks simulate those of ELLA as described in Figure 3, it is easy to see that the relationship I∼I′ holds for the
result I of PELLA and the result I′ of ELLA. We show this in the following through a sequence of simple theorems.
Theorem 4.10**.**
Assume Iu∼Iu′ for all u∈{0,1,2,…}. Then
[TABLE]
Proof.
[TABLE]
∎
We say that a function f simulates a function g iff I∼I′ implies fI∼gI′ for any I and I′.
Theorem 4.11**.**
If f simulates f′ and g simulates g′ then g∘f simulates g′∘f′.
Proof.
From I∼I′ follows fI∼f′I′, and from that
g(fI)∼g′(f′I′) follows. ∎
Theorem 4.12**.**
Assume that f simulates g. Then for any n∈{0,1,2,…} we have that fn simulates gn.
Proof.
This is an immediate consequence of Theorem 4.11.∎
Theorem 4.13**.**
Assume that f simulates g. Then limitf simulates limitg.
Proof.
We need to show that from I∼I′ it follows that
[TABLE]
which follows immediately from Theorems 4.10 and 4.12.
∎
Theorem 4.14**.**
πkT* simulates πk′ U(T).*
Proof.
We know that Predictk simulates Predict′k (Theorem 4.6),
Completek simulates Complete′k (Theorem 4.7) and
ScanTk simulates Scan′U(T)k (Theorem 4.9).
From this and Theorems 4.11 and 4.13 the result follows immediately.
∎
Theorem 4.15**.**
J00∼J′00**
Proof.
We have π0′∅Init′=π0′∅(Predict′0Init′), which together
with Theorems 4.3 and 4.14 implies J00∼J′00.
∎
Theorem 4.16**.**
Assume Jk0∼J′k0. Then
[TABLE]
Proof.
We first prove Jku∼J′ku and U(Tku)=T′ku by induction over u. The base case u=0 follows trivially
from our assumption Jk0∼J′k0 and the fact that U(∅)=∅. For the induction step u→u+1
we apply Theorem 4.8 and the induction hypothesis and obtain U(Tku+1)=T′ku+1. From this, together with
Theorem 4.14 and the induction hypothesis,
we obtain Jku+1∼J′ku+1.
From \mathcal{J}_{k}^{u}\sim{\mathcal{J}^{\prime}}_{k}^{\,u}\quad\text{for all u\in{0,1,2,\ldots}} and Theorem 4.10 follows then immediately Ik∼Ik′.
∎
Theorem 4.17**.**
[TABLE]
Proof.
The first three formulas follow immediately by induction over k and application of Theorem 4.15 and Theorem 4.16 for the base case, and Theorem 4.14
and Theorem 4.16 for the induction step, respectively.
We then deduce I=I∣D∣∼I∣D∣′=I′.
∎
We are now in a position to prove the correctness of PELLA.
Proof of Theorem 4.2.
We first introduce the abbreviation
[TABLE]
Then our goal is to prove σ(D)=P.
To prove σ(D)⊆P, assume β∈σ(D). Then there is a path p∈ℓℓ(D) such that
⊤⇒Sβϕ\textslstart⇒∗ [p]. This means that there is an item
x′=(Sβϕ\textslstart→w,0,∣D∣) which is p-valid, i.e. x′∈⟨P⟩.
From the correctness proof of ELLA we know that ⟨P⟩=I′ and thus x′∈I′, and the shape of x′ implies even x′∈Norm(I′). Therefore Theorem 4.17 allows us to deduce x′∈I, and this
means that there is an x∈I such that x′∈x and such that x has the shape
[TABLE]
From x′∈x follows immediately that ρ0=ϕ\textslstart and ρ2u+1=β, and therefore
β∈P.
To prove the other direction, P⊆σ(D), assume we have an x∈I such that
x=(Sfu+1→\textslX1f1…\textslXufu\makebox[7.0pt]\textbullet,0,∣D∣,ρ)
and ρ0=ϕ\textslstart. We need to prove ρ2u+1∈σ(D). Theorem 4.1 tells us that
there is an x′∈x, and this x′ has necessarily the form
[TABLE]
From Theorem 4.17 and the correctness of ELLA follows
x′∈x⊆I⊆I′=⟨P⟩.
Thus there is a path p∈P such that x′ is p-valid, i.e. there is v∈{0,…,∣p∣} such that
[TABLE]
As part of the correctness proof of ELLA we have previously seen that if we drop the first v empty tokens of p, the result
q=pv…p∣p∣−1 will still be in P (Theorem 5.2 in locallexing ). We deduce
[TABLE]
and q∈ℓℓ(D), and thus ρ2u+1∈σ(D).
∎