Learning with Partially Ordered Representations

Jane Chandlee; Remi Eyraud; Jeffrey Heinz; Adam Jardine; Jonathan; Rawski

arXiv:1906.07886·cs.FL·June 25, 2019

Learning with Partially Ordered Representations

Jane Chandlee, Remi Eyraud, Jeffrey Heinz, Adam Jardine, Jonathan, Rawski

PDF

TL;DR

This paper introduces a novel approach to grammar learning using partially ordered string representations, enabling more flexible modeling of shared properties at string positions and improving learning efficiency.

Contribution

It presents a new model-theoretic framework for grammars with shared, multi-property positions and an algorithm that efficiently learns the most general grammar from positive examples.

Findings

01

Structures are shown to be partially ordered.

02

The learning algorithm effectively prunes the hypothesis space.

03

It finds the most general grammar covering the data.

Abstract

This paper examines the characterization and learning of grammars defined with enriched representational models. Model-theoretic approaches to formal language theory traditionally assume that each position in a string belongs to exactly one unary relation. We consider unconventional string models where positions can have multiple, shared properties, which are arguably useful in many applications. We show the structures given by these models are partially ordered, and present a learning algorithm that exploits this ordering relation to effectively prune the hypothesis space. We prove this learning algorithm, which takes positive examples as input, finds the most general grammar which covers the data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Learning with Partially Ordered Representations

Jane Chandlee

Tri-Co Department of Linguistics

Haverford College

[email protected]

&Rémi Eyraud

QARMA team, LIS

Aix-Marseille University

[email protected]

\ANDJeffrey Heinz

Department of Linguistics

Institute for Advanced Computational Science

Stony Brook University

[email protected]

&Adam Jardine

Department of Linguistics

Rutgers University

[email protected]

\ANDJonathan Rawski

Department of Linguistics

Institute for Advanced Computational Science

Stony Brook University

[email protected]

Abstract

This paper examines the characterization and learning of grammars defined with enriched representational models. Model-theoretic approaches to formal language theory traditionally assume that each position in a string belongs to exactly one unary relation. We consider unconventional string models where positions can have multiple, shared properties, which are arguably useful in many applications. We show the structures given by these models are partially ordered, and present a learning algorithm that exploits this ordering relation to effectively prune the hypothesis space. We prove this learning algorithm, which takes positive examples as input, finds the most general grammar which covers the data.

1 Introduction

Foundational connections between formal languages, finite-state automata, and logic have been known for decades (Büchi, 1960; Thomas, 1997). Logical approaches are advantageous since they flexibly admit different representations. In many domains, such as biological sequencing or linguistics, shared properties of symbols in sequences provide information currently ignored by string-based inference algorithms, which largely focus on learning automata (de la Higuera, 2010). Here we explore the idea that domain-specific knowledge can be encoded representationally via model theory (Libkin, 2004), and shows how these representations can facilitate pattern learning.

This paper synthesizes results in grammatical inference and model theory to present a novel algorithm which learns classes of formal languages using enriched representations of strings. In fact, our model-theoretic approach immediately generalizes these results to arbitrary data structures. Here we are concerned with the learning of those formal languages which can be defined via a set of structural constraints, such as the Strictly $k$ -Local and Strictly $k$ -Piecewise languages (Rogers and Pullum, 2011; Rogers et al., 2010). Models of strings in the languages must not contain these forbidden structures (Rogers et al., 2013). Specifically, we define a learner whose hypothesis space is structured as a partial order by the relational signature of the particular model theory. We show how to traverse this space bottom-up from positive data to find a grammar which covers the data with the most general constraints.

The paper is structured as follows: Section 2 provides mathematical preliminaries in model theory. Section 3 characterizes ordering relations over these structures. Section 4 generalizes the grammars employed in string extension and lattice-based learning (Heinz, 2010; Heinz et al., 2012) to show how these model theoretic structures can define classes of formal languages. Section 5 discusses some entailments our learning algorithm takes advantage of. Section 6 defines a learning problem and criteria for selecting adequate solutions. Section 7 presents a general-to-specific, bottom-up algorithm which provably satisfies the learning criteria. Section 8 concludes the paper.

2 Preliminaries

2.1 Elements of Language Theory

The set of all possible finite strings of symbols from a finite alphabet $\Sigma$ and the set of strings of length $\leq n$ are $\Sigma^{*}$ and $\Sigma^{\leq n}$ , respectively. The unique empty string is represented with $\lambda$ . The length of a string $w$ is $|w|$ , so $|\lambda|=$ 0. If $u$ and $v$ are two strings then we denote their concatenation with $uv$ . If $w$ is a string and $\sigma$ is the $i$ th symbol in $w$ then $w_{i}=\sigma$ , so $abcd_{2}=b$ .

The set of prefixes of $w$ , $\mathtt{Pref}(w)$ , is $\{p\in\Sigma^{*}\mid(\exists s\in\Sigma^{*})[w=ps]\}$ , the set of suffixes of $w$ , $\mathtt{Suff}(w)$ , is $\{s\in\Sigma^{*}\mid(\exists p\in\Sigma^{*})[w=ps]\}$ , the set of substrings, $\mathtt{Substr}(w)$ , is $\{u\in\Sigma^{*}\mid(\exists l,r\in\Sigma^{*})[w=lur]\}$ , and the set of subsequences, $\mathtt{Subseq}(w)={u_{1}u_{2}\cdots u_{n}|\exists v_{0}\cdot v_{1}\cdots v_{n}\in\Sigma^{*}[w=v_{0}u_{1}v_{1}\cdots u_{n}v_{n}]}$

2.2 Elements of Finite Model Theory

Model theory, combined with logic, provides a powerful way to study and understand mathematical objects with structures (Enderton, 2001). In this paper we only consider finite relational models (Libkin, 2004) of strings in $\Sigma^{*}$ .

Definition 1 (Models).

A model signature is a tuple $S=\langle D;R_{1},R_{2},\ldots,R_{m}\rangle$ where the domain $D$ is a finite set, and each $R_{i}$ is a $n_{i}$ -ary relation over the domain. A model for a set of objects $\Omega$ is a total, one-to-one function from $\Omega$ to structures whose type is given by a model signature.

For example, a conventional model for strings in $\Sigma^{*}$ is given by the signature $\Gamma^{\lhd}\overset{def}{=}\langle D;\lhd,[R_{\sigma}]_{\sigma\in\Sigma}\rangle$ and the function $M^{\lhd}:\Sigma^{*}\to\Gamma^{\lhd}$ such that $M^{\lhd}(w)\overset{def}{=}\langle D^{w};\lhd,[R^{w}_{\sigma}]_{\sigma\in\Sigma}\rangle$ where $D^{w}\overset{def}{=}\{1,\ldots,|w|\}$ is the domain, $\lhd\overset{def}{=}\{(i,i+1)\in D\times D\}$ is the successor relation which orders the elements of the domain, and $[R^{w}_{\sigma}]_{\sigma\in\Sigma}$ is a set of $|\Sigma|$ unary relations such that for each $\sigma\in\Sigma$ , $R^{w}_{\sigma}\overset{def}{=}\{i\in D^{w}\mid w_{i}=\sigma\}$ . We will usually omit the superscript $w$ since it will be clear from the context.

For example, with $\Sigma=\{a,b,c\}$ and the model above for strings, we have $M^{\lhd}(abba)=\big{\langle}D=\{1,2,3,4\};\lhd=\{(1,2),(2,3),(3,4)\},R_{a}=\{1,4\},R_{b}=\{2,3\},R_{c}=\emptyset\big{\rangle}~{}.$

Figure 1 illustrates $M^{\lhd}(abba)$ on the left.

Another conventional model is the precedence model, with the signature $\Gamma^{<}\overset{def}{=}\langle D;<,[R_{\sigma}]_{\sigma\in\Sigma}\rangle$ . It differs from the successor model only in that the order relation is defined with general precedence $<\overset{def}{=}\{(i,j)\in D\times D\mid i<j\}$ (Büchi, 1960; McNaughton and Papert, 1971; Rogers et al., 2013). Under this signature, the string $abba$ has the following model.

$M^{<}(abba)=\big{\langle}D=\{1,2,3,4\};<=\{(1,2),(1,3),(1,4),(2,3),(2,4),(3,4)\},R_{a}=\{1,4\},R_{b}=\{2,3\},R_{c}=\emptyset\big{\rangle}$ .

Figure 1 illustrates $M^{<}(abba)$ on the right.

The model-theoretic framework is not unique to strings. It extends to arbitrary data structures by expanding parts of the model signature. For example, Rogers (2003) describes a model-theoretic characterization of trees of arbitrary dimensionality where the domain $D$ is a Gorn tree domain Gorn (1967). This is a hereditarily prefix closed set D of node addresses, that is to say, for every $d\in D$ with $d=\alpha i$ , where $i\in\mathbb{N}$ , $\alpha\in\mathbb{N}^{*}$ it holds that $\alpha\in D$ , and for every $d\in D$ with $d=\alpha i\neq\alpha 0,$ then $\alpha(i-1)\in D$ .

In this view, a string may be called a one-dimensional or unary-branching tree, since it has one axis along which its nodes are ordered. In a standard tree mdoel signature, the set of nodes is ordered by two binary relations, dominance" and immediate left-of". Suppose $s$ is the mother of two nodes $t$ and $u$ in some standard tree, and also assume that $t$ precedes $u$ . Then we might say that $s$ dominates the string $tu$ . Standard or two-dimensional trees, then, relate nodes to one-dimensional trees (strings) by immediate dominance. A three-dimensional tree relates nodes to two-dimensional, i.e. standard trees, corresponding to Tree-Adjoining Grammar derivations. In general, a $d$ -dimensional tree is a set of nodes ordered by $d$ dominance relations such that the $n$ -th dominance relation relates nodes to $(n-1)$ -dimensional trees (for $d=1$ , single nodes are zero-dimensional trees).

While a Gorn tree domain as written encodes these dominance and precedence relations implicitly, we may explicitly write them out model-theoretically so that a signature for a $\Sigma$ -labeled 2- $d$ tree is $\Gamma^{\lhd\prec}=\langle D;\lhd,\prec,[R_{\sigma}]_{\sigma\in\Sigma}\rangle$ where $\lhd$ is the immediate dominance" relation and $\prec$ is the immediate left-of" relation. Model signatures that include transitive closure relations of each of these have also been studied.

2.3 Unconventional Word Models

Whereas Rogers (2003) generalized conventional word models to trees, here we generalize word models in a different way. Conventional string models are the successor and precedence models introduced previously. What makes these models conventional is the unary relations which essentially label each domain element with a single, mutually exclusive, property: the property of being some $\sigma\in\Sigma$ .

In contrast, unconventional models for strings recognize that distinct alphabetic symbols may share properties, and expands the model signature by including these properties as unary relations (Strother-Garcia et al., 2016; Vu et al., 2018). For example, a conventional model of $\Sigma=\{\mathtt{a,\ldots,z,A,\ldots,Z}\}$ would include 52 unary relations, one for each lowercase and capital letter. On the other hand, an unconventional model might only include 27: 26 for the letters, and one unary relation Capital. Then, letters A and a share the `A' property and A additionally has the property of being Capital.

In linguistics, speech sounds are commonly decomposed into binary features based on their phonetic properties. So the set of segments $\{$ z,Z,d,b,g,… $\}$ all share the property +Voice, meaning the vocal cords are activated, while the segments $\{$ s,S,t,p,k,… $\}$ share the property -Voice, meaning the vocal cords are not activated. Thus unconventional models may refer to individual features in defining grammatical constraints, rather than each individual segment.

Different representations of strings and trees provide a unified perspective on well-known subclasses of the regular languages from a model-theoretic and logical perspective (Thomas, 1997; Rogers et al., 2013). However, they also open up new doors for grammatical inference by allowing one to consider other models for strings (Strother-Garcia et al., 2016; Vu et al., 2018).

3 Subfactors, Superfactors, Ideals and Filters

We sometimes refer to the model of a string $w$ as a structure. However, structures are more general in that they correspond to any mathematical structure conforming to the model signature. As such, while a model of a string $w$ will always be a structure, a structure will not always be a model of a string $w$ . The size of a structure $S$ , denoted $|S|$ , coincides with the cardinality of its domain.

We next wish to introduce a partial ordering over structures. To do so, we must define the terms connected, restriction, and factor. For each structure $S=\langle D;\lhd,R_{1},\ldots R_{n}\rangle$ let the binary ``connectedness'' relation $C$ be defined as follows.

$C\overset{def}{=}\big{\{}(x,y)\in D\times D\mid\exists i\in\{1\ldots n\},\exists(x_{1}\ldots x_{m})\in R_{i},\exists s,t\in\{1\ldots m\},x=x_{s},y=x_{t}\big{\}}$

Informally, domain elements $x$ and $y$ belong to $C$ provided they belong to some non-unary relation. Let $C^{*}$ denote the symmetric transitive closure of $C$ .

Definition 2 (Connected structure).

A structure $S=\langle D;\lhd,R_{1},R_{2},\ldots,R_{n}\rangle$ is connected iff for all $x,y\in D$ , $(x,y)\in C^{*}$ .

For example, $M^{\lhd}(abba)$ above is a connected structure. However, the structure $S_{ab,~{}ba}$ shown below which is identical to $M^{\lhd}(abba)$ except it omits the pair (2,3) from the order relation is not connected since none of (1,3),(1,4), (2,3) nor (2,4) belong to $C^{*}$ . $S_{ab,~{}ba}=\big{\langle}D=\{1,2,3,4\};\lhd=\{(1,2),(3,4)\},R_{a}=\{1,4\},R_{b}=\{2,3\},R_{c}=\emptyset\big{\rangle}$

1a2b3b4a $\triangleleft$$\triangleleft$

Note that no string in $\Sigma^{*}$ has structure $S_{ab,~{}ba}$ as its model.

Definition 3.

$A=\langle D^{A};\lhd,R_{1}^{A},\ldots,R_{n}^{A}\rangle$ * is a restriction of $B=\langle D^{B};\lhd,R_{1}^{B},\ldots,R_{n}^{B}\rangle$ iff $D^{A}\subseteq D^{B}$ and for each $m$ -ary relation $R_{i}$ , we have $R_{i}^{A}=\{(x_{1}\ldots x_{m})\in R_{i}^{B}\mid x_{1},\ldots,x_{m}\in D^{A}\}$ .*

Informally, one identifies a subset $A$ of the domain of $B$ and strips $B$ of all elements and relations which are not wholly within $A$ . What is left is a restriction of $B$ to $A$ .

Definition 4.

Structure $A$ is a subfactor of structure $B$ ( $A\sqsubseteq B$ ) if $A$ is connected, there exists a restriction of $B$ denoted $B^{\prime}$ , and there exists $h:A\rightarrow B^{\prime}$ such that for all $a_{1},\ldots a_{m}\in A$ and for all $R_{i}$ in the model signature: if $h(a_{1}),\ldots h(a_{m})\in B^{\prime}$ and $R_{i}(a_{1},\ldots a_{m})$ holds in $A$ then $R_{i}(h(a_{1}),\ldots h(a_{m}))$ holds in $B^{\prime}$ . If $A\sqsubseteq B$ we also say that $B$ is a superfactor of $A$ .

In other words, properties that hold of the connected structure $A$ also hold in a related way within $B$ .

If $A\sqsubseteq B$ and $|A|=k$ then we say $A$ is a $k$ -subfactor of $B$ . For all $w\in\Sigma^{*}$ , and for any model $M$ of $\Sigma^{*}$ , let the subfactors of $w$ be $\mathtt{Subfact}(M,w)=\{A\mid A\sqsubseteq M(w)\}$ and the $k$ -subfactors of $w$ be $\mathtt{Subfact}_{k}(M,w)=\{A\mid A\sqsubseteq M(w),~{}|A|\leq k\}$ . We also define $\mathtt{Subfact}(M,\Sigma^{*})$ to be $\bigcup_{w\in\Sigma^{*}}\mathtt{Subfact}(M,w)$ and $\mathtt{Subfact}_{k}(M,\Sigma^{*})$ to be $\bigcup_{w\in\Sigma^{*}}\mathtt{Subfact}_{k}(M,w)$ . When $M$ is understood from context, we write $\mathtt{Subfact}(w)$ instead of $\mathtt{Subfact}(M,w)$ . We define the sets of superfactors $\mathtt{Supfact}(M,w)$ and $\mathtt{Supfact}(M,\Sigma^{*})$ similarly.

Observe that $(\mathtt{Subfact}(M,w),\sqsubseteq)$ is a partially ordered set (poset). The next definition and lemma establishes that models of strings are principal elements of ideals and filters.

Definition 5 (Ideals).

A subset $I$ of a poset is an Ideal if

•

$I$ * is non-empty*

•

for every $x$ in $I$ , $y\leq x$ implies that $y$ is in $I$

•

for every $x,y$ in $I$ , there exists some element $z$ in $I$ , such that $x\leq z$ and $y\leq z$ .

The dual of an ideal is a filter.

Definition 6 (Filters).

A subset $F$ of a poset is a filter iff

•

$F$ * is non-empty*

•

for every $x$ in $F$ , $x\leq y$ implies that $y$ is in $F$

•

for every $x,y$ in $F$ , there exist some element $z$ in $F$ , such that $z\leq x$ and $z\leq y$ .

Definition 7 (Principal Ideals, Filters and Elements).

For any poset $\langle X,\leq\rangle$ , the smallest filter containing $x\in X$ is a principal filter and $x$ is the principal element of this filter. Similarly, the smallest ideal containing $x\in X$ is a principal ideal and $x$ is the principal element of this ideal.

Remark 1.

Given a model $M$ of $\Sigma^{*}$ and $k>0$ , $\mathtt{Subfact}_{k}(M,w)$ is a principal ideal in $\mathtt{Subfact}(M,\Sigma^{*})$ whose principal element is $M(w)$ . $\mathtt{Supfact}_{k}(M,w)$ is a principal filter in $\mathtt{Supfact}(M,\Sigma^{*})$ whose principal element is $M(w)$ . The empty structure $\langle\emptyset;\emptyset,\ldots\emptyset\rangle$ is a subfactor of every structure in $\mathtt{Subfact}(M,\Sigma^{*})$ .

The next two propositions show how this representational perspective unifies the treatment of substrings and subsequences. They are subfactors under the successor and precedence models, respectively. A string $x=x_{1}\cdots x_{n}$ is a substring of $y$ iff there exists $l,r$ such that $y=lxr$ . String $x$ is a subsequence of $y$ iff there exists $v_{0},v_{1},\ldots v_{n}$ such that $w=v_{0}x_{1}v_{1}\cdots x_{n}v_{n}$ .

Proposition 1 (Substrings are subfactors under $M^{\lhd}$ ).

For all strings $x,y\in\Sigma^{*}$ , $x$ is a substring of $y$ iff $M^{\lhd}(x)\sqsubseteq M^{\lhd}(y)$ .

Proof.

Note that the result trivially holds for $x=\lambda$ : we restrict ourselves to the case $x\neq\lambda$ . Let $M^{\lhd}(x)=\langle D^{x};\lhd,[R^{x}_{\sigma}]\rangle$ and $M^{\lhd}(y)=\langle D^{y};\lhd,[R^{y}_{\sigma}]\rangle$

( $\Rightarrow$ ). Suppose $x$ is a substring of $y$ : it exists $l,r$ such that $y=lxr=\sigma_{1}\ldots\sigma_{|l|}\sigma_{|l|+1}\ldots\sigma_{|l|+|x|}\sigma_{|l|+|x|+1}\ldots\sigma_{|l|+|x|+|r|}$ . This implies that, for all $i$ , $1\leq i\leq|x|$ , $d\in R^{y}_{\sigma_{|l|+i}}$ iff $d\in R^{x}_{\sigma_{i}}$ . Thus, if we set the isomorphism $\phi$ to be such that $\phi(i)=|l|+i$ for $1\leq i\leq|x|$ , we have $\phi(M^{\lhd}(x))$ that is a restriction of $M^{\lhd}(y)$ , and therefore $M^{\lhd}(x)\sqsubseteq M^{\lhd}(y)$ by definition.

( $\Leftarrow$ ). Let $y$ be the sequence of letters $\sigma_{1}\ldots\sigma_{|y|}$ and suppose $M^{\lhd}(x)\sqsubseteq M^{\lhd}(y)$ : there exists a isomorphism $\phi:\{1,\ldots,|x|\}\to\{1,\ldots,|y|\}$ such that $\phi(M^{\lhd}(x))$ is a restriction of $M^{\lhd}(y)$ . This means that $\phi(D^{x})\subseteq D^{y}$ and for all $\sigma$ : $\phi(R^{x}_{\sigma})=\{\phi(i)\in R^{y}_{\sigma}\mid\phi(i)\in\phi(D^{x})\}$ (Definition 3). This implies that $x=\sigma_{\phi(1)}\ldots\sigma_{\phi(|x|)}$ . Given that $\lhd=\{(i,i+1)\in D\times D\}$ , we have $\phi(i+1)=\phi(i)+1$ and thus there exist $l$ and $r$ in $\Sigma^{*}$ such that $y=l\sigma_{\phi(1)}\ldots\sigma_{\phi(|x|)}r=lxr$ . ∎

Proposition 2 (Subsequences are subfactors under $M^{<}$ ).

For all strings $x,y\in\Sigma^{*}$ , $x$ is a subsequence of $y$ iff $M^{<}(x)\sqsubseteq M^{<}(y)$ .

Proof.

We leave this proof to the Reader since it is of similar nature to the previous one. ∎

4 Grammars, Languages, and Language Classes

Factors can define grammars, formal languages, and classes of formal languages. Usually a model signature provides the vocabulary for some logical language. Sentences in this logical language define sets of strings as follows. The language of a sentence $\phi$ is all and only those strings whose models satisfy $\phi$ . Within the regular languages, many well-known subregular classes can be characterized logically in this way (McNaughton and Papert, 1971; Rogers and Pullum, 2011; Rogers et al., 2013; Thomas, 1997).

Intuitively, the grammars we are interested in consist of a finite list of forbidden subfactors, whose largest size is bounded by $k$ . Strings in the language of this grammar are those which do not contain any forbidden subfactors. In this way these grammars are like logical expressions which are "conjunctions of negative literals" (Rogers et al., 2013) where the negative literals are played by the the forbidden factors.

Each forbidden subfactor is a principal element of a filter and the language is all strings whose models are not in any of these filters. For each $k$ , there is a class of languages including all and only those languages that can be defined in this way. For example, the Strictly $k$ -Local (SLk) and Strictly $k$ -Piecewise languages can be defined in this way; they are languages which forbid finitely many substrings or subsequences, respectively (Garcia et al., 1990; Rogers et al., 2010). Formally:

Definition 8.

Let $k$ be some positive integer, and $M$ a model of $\Sigma^{*}$ with signature $\Gamma$ . A grammar $G$ is a subset of $\mathtt{Subfact}_{k}(M,\Sigma^{*})$ . The language of $G$ is $L(G)=\{w\in\Sigma^{*}\mid\mathtt{Subfact}_{k}(M,w)\cap G=\emptyset\}$ . The class of languages $\mathcal{L}(M,k)=\{L\mid\exists G\subseteq\mathtt{Subfact}_{k}(M,\Sigma^{*}),L(G)=L\}$ .

The elements of $G$ are principal elements of filters, and are called forbidden subfactors.

As an example, let $\Sigma=\{a,b,c\}$ and consider $G=\{M^{\lhd}(aa),M^{\lhd}(bb),M^{\lhd}(c)\}$ . $L(G)$ includes the strings $(ab)^{+}$ and $(ba)^{+}$ and no other strings, because the substrings $aa$ , $bb$ , and $c$ are all forbidden. This language belongs to $\mathcal{L}(M^{\lhd},2)$ .

Proposition 3.

For each $w\in L(G)$ and each $g\in G$ , $\mathtt{Subfact}(M,w)$ has a zero intersection with $\mathtt{Supfact}(g)$ .

Proof.

Suppose there exists $A\in\mathtt{Subfact}_{k}(\Sigma^{*})$ such that $A\sqsubseteq M(w)$ and $g\sqsubseteq A$ . This implies that $g\sqsubseteq M(w)$ and thus that $\mathtt{Subfact}_{k}(M,w)\cap G\neq\emptyset$ which contradicts Definition 8. ∎

In other words, the principal ideal of $M(w)$ is disjoint from the principal filters of the elements of $G$ .

5 Grammatical Entailments

Given a grammar $G$ , we call a subfactor $s$ in $\mathtt{Subfact}(\Sigma^{*})$ ungrammatical if it belongs to a principal filter of any element of $G$ . Subfactors that are not ungrammatical are called grammatical. Lemma 14 ensures that grammaticality is downward entailing, in the sense that if a model of the word $M(w)$ is not contained in the principal filters of the elements of the grammar, then neither are the subfactors of $M(w)$ . But it also ensures that ungrammaticality is upward entailing: if a model of the word $M(w)$ belongs to the principal filters of the elements of the grammar, then all of the superfactors of $M(w)$ in that filter are likewise contained.

In this way, the ideals and filters within a a particular model noted above give rise to these entailment properties of grammaticality with respect to the hypothesis space. If the learner constructs filters, then the grammar $G$ will allow structures such that language membership is downward entailing with respect to the grammar $G$ , and language non-membership is upward entailing with respect to the grammar $G$ .

5.1 Example: Text Capitalization

As an example, consider capitalized letters as discussed above. In an unconventional word model, each capital letter at some position $x$ is represented as satisfying one of the relations $R\in\{\mathtt{a}(x),\mathtt{b}(x),\ldots,\mathtt{z}(x)\}$ as well as the unary relation $\mathtt{capital}(x)$ . Thus the relation $\mathtt{a}(x)$ is true of both lowercase $\mathtt{a}$ and uppercase $\mathtt{A}$ , but $\mathtt{a}(x)\land\mathtt{capital}(x)$ is only true of uppercase $\mathtt{A}$ . Note also that in this model no position $x$ of a structure can satisfy both predicates $\mathtt{a}(x)$ and $\mathtt{b}(x)$ . We return to this point in §7.

Figure 2 showcases the relationship among these structures under a model $M$ . The structure for $\mathtt{A}$ , $[\mathtt{capital},\mathtt{a}]$ , contains as subfactors $[\mathtt{capital}]$ , $[\mathtt{a}]$ , [], and the empty structure (not shown). The empty structure is a subfactor of [], and [] in turn is a subfactor of $[\mathtt{capital}]$ and $[\texttt{a}]$ . The subfactor $[\mathtt{a}]$ contains the subfactor [], the domain element with no relations, but has superfactors [capital,a], which has one domain element and two relations, and [a][], which has two domain elements, and the first satisfying the property a. Subfactors and superfactors are listed above and below each other, respectively, with lines between them. Members of one ideal are noted with a blue checkmark, and members of a filter are noted by a red asterisk.

Applying this to the example in Figure 3, if the structure $\mathtt{[capital,a]}$ is grammatical, then all of its subfactors, such as [capital] and [a], and [] are grammatical. Since those are grammatical, each of their subfactors is also grammatical, which in this case is just [ $\emptyset$ ], shown in blue in Figure 3. Conversely, if the structure [a][] is known to be ungrammatical, then any structure which has it as a subfactor is also ungrammatical (in this example, [capital,a][], shown in Red in Figure 3. To see the importance, consider a string with only lowercase letters. In a connected model, the grammar would ban 26 forbidden factors (A,B,C,…), but the ``capital" model bans just one, [capital].

5.2 Example: Long Distance Linguistic Dependencies

As another example, sequences of speech sounds as mentioned earlier may be decomposed into binary features based on their phonetic properties like anteriority ( $\pm$ ant — whether it occurs in the anterior of the vocal tract), stridency ( $\pm$ str — whether it produces a high-intensity fricative noise), or voicing ( $\pm$ voi — whether it activates the vocal chords), among others (Hayes, 2009). Each sound at some position $x$ is represented as satisfying relations $R\in\{\mathtt{\pm voi}(x),\mathtt{\pm str}(x),\ldots,\mathtt{\pm ant}(x)\}$ . Thus the relation $\mathtt{+str}(x)$ is true of both the sound s as in the first sound of sue" and S, as in shoe", but $\mathtt{+str}(x)\land\mathtt{-ant}(x)$ is only true of S.

Note also that in this model no position $x$ of a structure can satisfy both predicates $\mathtt{+str}(x)$ and $\mathtt{-str}(x)$ . We return to this point in §7 below. We again use square brackets to delimit the domain elements and write the unary features within them, so a model representation like $\left[\shortstack{+str\\ +ant\\ [-.7em]}\right]\left[\shortstack{+str\\ -ant\\ [-.7em]}\right]$ has the following visual representation:

+str

+ant

+str

-ant

$<$

To ease the exposition, we will use square brackets to delimit the domain elements and write the unary relations within them instead of specifying the model in mathematical detail. In an unconventional subsequence word model, then, one possible structure of the subsequence s…S is written $\left[\shortstack{+str\\ +ant\\ [-.7em]}\right]\left[\shortstack{+str\\ -ant\\ [-.7em]}\right]$ .

In many languages, the presence of certain segments is dependent on the presence of another segment. In Samala, subsequences like s…s are allowed but s…S are not, so words like hasxintilawas are allowed but words like hasxintilawaS are not (Hansson, 2010). In an unconventional model, banning structures of the form [+str][+str] is insufficient, since all these segments share that stridency property, while a structure like $\left[\shortstack{+str\\ +ant\\ [-.7em]}\right]\left[\shortstack{+str\\ -ant\\ [-.7em]}\right]$ will distinguish them, since they disallow stridents which disagree on the $\pm$ ant $(x)$ relations. The structure [+ant][-ant] however, is insufficient, since consonants like p,b,m have that feature, and would incorrectly ban acceptable strings. To see the importance, a conventional string model must ban multiple sibilant factors sS,zS,sZ,zZ, while an unconventional model must just ban one, $\left[\shortstack{+str\\ +ant\\ [-.7em]}\right]\left[\shortstack{+str\\ -ant\\ [-.7em]}\right]$

Figure 4 showcases the relationship among these structures under a precedence model $M^{<}$ . The structure for $\left[\shortstack{+str\\ +ant\\ [-.7em]}\right]\left[\shortstack{+str}\right]$ contains as subfactors (among others) $\left[\shortstack{+str}\right]\left[\shortstack{+str}\right]$ , $\left[\shortstack{+str}\right]\left[\right]$ , [], and the empty structure (not shown). The empty structure is a subfactor of [], and [] in turn is a subfactor of $[\mathtt{+ant}]$ and $[\texttt{-str}]$ , and so on. If the structure $\left[\shortstack{+str\\ +ant\\ [-.7em]}\right]\left[\shortstack{+str\\ +ant\\ [-.7em]}\right]$ is grammatical, then all of its subfactors, are grammatical, and so are their subfactors, in turn. Conversely, if the structure $\left[\shortstack{+str\\ +ant\\ [-.7em]}\right]\left[\shortstack{+str\\ -ant\\ [-.7em]}\right]$ is known to be ungrammatical, then any structure which has it as a subfactor is also ungrammatical (for example, $\left[\shortstack{+voi\\ +str\\ +ant\\ [-2.5ex]}\right]\left[\shortstack{+str\\ -ant\\ [-1.5ex]}\right]$ , where the first segment is also voiced +voi), shown in Red in Figure 4.

The structure filters give the learner an advantage when confronting hypothesis spaces under a particular model. In particular, it allows the learner to prune vast swathes of the hypothesis space as it reaches for principal elements of features. If a learner identifies one structure as being grammatical, the learner may infer that all of its subfactors are also grammatical and not have to consider them. Alternatively, if the learner knows a structure is ungrammatical, it may infer that the ideals above it are also ungrammatical.

Generally, these reductions can be exponential: an alphabet of size $2^{n}$ can be represented with $n$ unary relations in the model signature. However, this exponential reduction does not necessarily make learning any easier. The reason for this is that the size of $\mathtt{Subfact}_{k}(M,\Sigma^{*})$ equals $\sum_{i=1}^{k}(2^{n})^{i}$ where $n$ is the number of unary relations. Since a grammar is defined as a subset of $\mathtt{Subfact}_{k}(M,\Sigma^{*})$ , the number of considered grammars is thus very large. Therefore, the problem of how to search this space effectively is paramount.

6 The Learning Problem

For some $M,k$ , is $\mathcal{L}(M,k)$ learnable from positive data? The short answer is Yes (Heinz, 2010; Heinz et al., 2012). The solution presented in these papers can be thought of as using the function $\mathtt{Subfact}_{k}(M,w)$ to identify permissible $k$ -factors in words $w$ in the positive data. The $k$ -factors that are not permissible are forbidden. With sufficient positive data, such a learning algorithm will converge to a grammar that generates any target language in the class. While this solution is sound in theory, when the space of $k$ -factors is very large, it is not practical. Here, we make clear the problem the learning algorithm solves.

We state the learning problem not in terms of converging to a correct grammar in the limit as previously studied, but instead of returning an `adequate' grammar given a finite positive sample. Determining what counts as an adequate grammar is what (De Raedt, 2008) calls a Quality Criterion.

Definition 9 (The Learning Problem).

Fix $\Sigma$ , model $M$ , and positive integer $k$ . For any language $L\in\mathcal{L}(M,k)$ and for any finite $D\subseteq L$ , return a grammar $G$ such that

$G$ is consistent, that is, it covers the data: $D\subseteq L(G)$ ; 2. 2.

$L(G)$ is a smallest language in $\mathcal{L}$ which covers the data: so for all $L\in\mathcal{L}$ where $D\subseteq L$ , we have $L(G)\subseteq L$ ; and 3. 3.

$G$ includes structures $S$ that are restrictions of structures $S^{\prime}$ included in other grammars $G^{\prime}$ that also satisfy (1) and (2): for all $G^{\prime}$ satisfying the first two criteria for all $S^{\prime}\in G^{\prime}$ , there exists $S\in G$ such that $S\sqsubseteq S^{\prime}$ .

The first criterion is self-explanatory. The second criterion is motivated by Angluin's (1980) analysis of identification in the limit. The third criterion requires that the grammar contain the most ``general'' subfactors. An example will help illustrate this criterion.

Consider again the grammar $G=\{M^{\lhd}(aa),M^{\lhd}(bb),M^{\lhd}(c)\}$ with $\Sigma=\{a,b,c\}$ . $L(G)$ is the same as $L(H)$ where $H=\{M^{\lhd}(aa),M^{\lhd}(bb),M^{\lhd}(ac),M^{\lhd}(bc),M^{\lhd}(cc),\\ M^{\lhd}(ca),M^{\lhd}(cb)\}$ . In $H$ all the forbidden subfactors are of size 2, whereas $G$ encapsulates all of the 2-factors in $H$ which include $c$ with a single 1-factor $M^{\lhd}(c)$ . Both grammars $G$ and $H$ may satisfy criteria (1) and (2) but $H$ would not satisfy criterion (3) because of $G$ .

7 A Bottom-Up Learning Algorithm

(De Raedt, 2008) identifies two directions of inference: specific-to-general (i.e., top-down') and general-to-specific (i.e., bottom-up'). Since we are trying to find the most general subfactors, top-down inference has the potential to consider exponentially many more subfactors than bottom-up inference. It makes mores sense to traverse bottom-up, that is, from the most general subfactors possible to the most specific. Additionally, once a subfactor is identified as an element of the grammar, none of its superfactors (elements of its principal filter) need to be considered further.

A bottom-up learner is shown in Algorithm 1. Its input is a positive data sample $D$ and an integer $k$ that identifies the upper bound on the size of the subfactors.

The algorithm makes use of a queue $Q$ , which is initialized to contain just the empty structure $s_{0}$ . It also initializes two empty sets: $G$ , the grammar that will ultimately be returned, and $V$ , the set of `visited subfactors'. The subfactors in $Q$ are considered one at a time, in order, and as each subfactor $s$ is considered it is added to $V$ . If $s$ is not a subfactor of the model of any word in the positive sample $D$ (i.e., not contained by any data point in $D$ ), then it is added to the grammar $G$ .

If $s$ is a subfactor of the sample, it is sent to the function $\mathtt{NextSupFact}$ , which returns a set of least superfactors for $s$ . For concreteness, $\mathtt{NextSupFact}(s)$ may be defined formally as follows:

$\mathtt{NextSupFact}(s)=\{S\in\mathtt{Subfact}_{k}(\Sigma^{*})\mid s\sqsubseteq S,\neg\exists\,S^{\prime}[s\sqsubseteq S^{\prime}\sqsubseteq S]\}$ .

Practically $\mathtt{NextSupFact}$ will be defined constructively so that each subfactor in $\mathtt{Subfact}_{k}(\Sigma^{*})$ is constructed only once as needed. Thus, not only will it not be needed to store the whole set $\mathtt{Subfact}_{k}(\Sigma^{*})$ in memory, but the set $V$ may be excluded from the algorithm as well.

This set of superfactors is then filtered by the following criteria: they must be smaller than $k+1$ , they must contain no element of $G$ as a subfactor, and they must not have been previously considered (i.e., they cannot be in $V$ ). Those structures that survive this filter are added to $Q$ . This procedure continues until there are no more structures left to consider in $Q$ .

Theorem 1.

For any $L\in\mathcal{L}(M,k)$ , and any finite set $P\subseteq L$ provided as input to Algorithm 1, it returns a grammar $G$ satisfying Definition 9.

Proof.

Consider any $x\in D$ . Algorithm 1 only adds elements to $G$ that are not subfactors of $x$ , so $x\not\in\mathtt{Supfact}(G)$ . Thus $x\in L(G)$ and $D\subseteq L(G)$ , satisfying Condition (1).

Consider any $L^{\prime}\in\mathcal{L}$ with $D\subseteq L^{\prime}$ . To show $L=L(G)\subseteq L^{\prime}$ , consider any $w\in L$ . Then $\mathtt{Subfact}(w)\subseteq\mathtt{Subfact}(D)$ and $\mathtt{Subfact}(D)\subseteq\mathtt{Subfact}(L^{\prime})$ since $D\subseteq L$ . Then $\mathtt{Subfact}(w)\subseteq\mathtt{Subfact}(L^{\prime})$ . Hence, $w\in L^{\prime}$ , and so $L\subseteq L^{\prime}$ , satisfying Condition (2).

For condition (3), we use the fact that elements in the grammar $G$ were in Q at some point. Suppose $s,s^{\prime}$ are subfactors such that $s\in G$ , $s^{\prime}\sqsubseteq s$ , and ( $\neg\exists x\in D)[s^{\prime}\sqsubseteq M(x)]$ . Since $s\in G$ , then at some point $s\in Q$ .

If $s^{\prime}\sqsubseteq s$ then $s^{\prime}$ will be added to $Q$ before $s$ is generated by $\mathtt{NextSupFact}$ . Because $Q$ is a queue, $s^{\prime}$ will also be removed from $Q$ before $s$ is generated by $\mathtt{NextSupFact}$ . Since $s^{\prime}$ is not contained by any $M(x)$ with $x\in D$ , it will be added to $G$ . When $s$ is generated by $\mathtt{NextSupFact}$ , it will not pass the filter because it fails the second criterion since $s^{\prime}\sqsubseteq s$ and $s^{\prime}\in G$ . Then $s$ is never added to $Q$ , and therefore $s\notin G$ , contra our original assumption. Thus Condition (3) is satisfied. ∎

One aspect of the algorithm to highlight is that when a subfactor $g$ is added to $G$ , it is not added to $Q$ . Consequently, $\mathtt{NextSupFact}(g)$ is never added to $Q$ . In this way, finding elements of $G$ prunes the remainder of the space to be searched (see figure 5). In general, it is not the case that every element in the principal filter of $g$ will not be generated by $\mathtt{NextSupFact}$ since some of these elements may belong to $\mathtt{NextSupFact}(x)$ for other subfactors $x$ on the $Q$ . We expect subfactors on the `border' of $\mathtt{Supfact}(g)$ to be generated in this way (and then they are filtered out). This pruning, especially when the subfactors are quite general, can significantly reduce the remaining space to be traversed.

In regard to efficiency, in the worst case, the elements of $G$ are all very specific subfactors and are greatest elements of $\mathtt{Subfact}_{k}(\Sigma^{*})$ . In this case, every subfactor $\mathtt{Subfact}_{k}(\Sigma^{*})$ will be added to $Q$ and the time complexity is thus exponential. However, we are primarily interested in the case when $\mathtt{Subfact}_{k}(D)$ are a small proportion of $\mathtt{Subfact}_{k}(\Sigma^{*})$ . This constitutes an example of data sparsity. In this case, we believe the elements of the target grammar will be much `lower' in the partial order and thus will be found much more quickly. Determining what conditions on $\mathtt{Subfact}_{k}(D)$ and $\mathtt{Subfact}_{k}(\Sigma^{*})$ result in a polynomial time run in the size of $D$ is a focus of current research activity.

Another area of active research is developing a recipe for the $\mathtt{NextSupFact}$ function for models with a successor or precedence order relation and arbitary unary relations. The basic idea underlying the bottom-up algorithm is to develop a spanning tree for the poset $\mathtt{Subfact}(\Sigma^{*})$ and to traverse this tree in a breadth-first manner. The function $\mathtt{NextSupFact}$ helps control this search. Ideally, $\mathtt{NextSupFact}$ would only generate each subfactor once, which obviates the need to store visited subfactors in $V$ . This can be accomplished to some extent in different ways. For incompatible unary relations, like $\mathtt{a}$ and $\mathtt{b}$ in our capitalization example, $\mathtt{NextSupFact}$ can be defined to prevent adding property $\mathtt{a}$ to a position that already satisfies property $\mathtt{b}$ .

For compatible unary relations, like $\mathtt{a}$ and $\mathtt{capital}$ in our capitalization example, an ordering over the unary relations such as $\mathtt{a}<\mathtt{b}<\mathtt{capital}$ can help eliminate generating the same subfactor in different ways. For example, if $\mathtt{NextSupFact}$ is defined to only add `lesser' unary relations to positions that already have them then it would only output [ $\mathtt{capital,a}$ ] given the subfactor [ $\mathtt{a}$ ] as input. On the other hand, when given as input the subfactor [ $\mathtt{capital}$ ], it could not add any unary relation to this position.

8 Conclusion

In this paper, we considered the problem of learning formal languages defined as the complement of the union of finitely many principal filters, whose principal elements make up the grammar. This is one way to characterize the Strictly $k$ -Local and Strictly $k$ -Piecewise languages, but the generalization here lets us consider enriched representations of strings where different elements in a string can be said to share properties. it also lets us learn the shortest forbidden substrings in $SL_{k}$ (Ron et al., 1996) This is useful in many applications where domain-specific knowledge is available and should be taken advantage of. Such enriched representations, however, have a drawback. The number of subfactors is large which makes identifying the principal elements of the filters difficult. This paper showed that the partial ordering of the subfactors motivates a bottom-up learning algorithm which finds the least subfactors whose filters do not include the positive data.

Acknowledgments

We would like to thank James Rogers for very helpful discussion on the notion of subfactor. This work was supported by NIH grant #R01HD87133-01 to JH.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Angluin (1980) Dana Angluin. 1980. Inductive inference of formal languages from positive data. Information Control , 45:117–135.
2Büchi (1960) J. Richard Büchi. 1960. Weak second-order arithmetic and finite automata. Mathematical Logic Quarterly , 6(1-6):66–92.
3De Raedt (2008) Luc De Raedt. 2008. Logical and Relational Learning . Springer-Verlag Berlin Heidelberg.
4Enderton (2001) Herbert B. Enderton. 2001. A Mathematical Introduction to Logic , 2nd edition. Academic Press.
5Garcia et al. (1990) Pedro Garcia, Enrique Vidal, and José Oncina. 1990. Learning locally testable languages in the strict sense. In Proceedings of the Workshop on Algorithmic Learning Theory , pages 325–338.
6Gorn (1967) Saul Gorn. 1967. Explicit definitions and linguistic dominoes. In Systems and Computer Science , pages 77–115, Toronto. University of Toronto Press.
7Hansson (2010) Gunnar Hansson. 2010. Consonant Harmony: Long-Distance Interaction in Phonology . Number 145 in University of California Publications in Linguistics. University of California Press, Berkeley, CA. Available on-line (free) at e Scholarship.org.
8Hayes (2009) Bruce Hayes. 2009. Introductory Phonology . Wiley-Blackwell.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Learning with Partially Ordered Representations

Abstract

1 Introduction

2 Preliminaries

2.1 Elements of Language Theory

2.2 Elements of Finite Model Theory

Definition 1** (Models).**

2.3 Unconventional Word Models

3 Subfactors, Superfactors, Ideals and Filters

Definition 2** (Connected structure).**

Definition 3**.**

Definition 4**.**

Definition 5** (Ideals).**

Definition 6** (Filters).**

Definition 7** (Principal Ideals, Filters and Elements).**

Remark 1**.**

Proposition 1** (Substrings are subfactors under M⊲M^{\lhd}M⊲).**

Proof.

Proposition 2** (Subsequences are subfactors under M<M^{<}M<).**

Proof.

4 Grammars, Languages, and Language Classes

Definition 8**.**

Proposition 3**.**

Proof.

5 Grammatical Entailments

5.1 Example: Text Capitalization

5.2 Example: Long Distance Linguistic Dependencies

6 The Learning Problem

Definition 9** (The Learning Problem).**

7 A Bottom-Up Learning Algorithm

Theorem 1**.**

Proof.

8 Conclusion

Acknowledgments

Definition 1 (Models).

Definition 2 (Connected structure).

Definition 3.

Definition 4.

Definition 5 (Ideals).

Definition 6 (Filters).

Definition 7 (Principal Ideals, Filters and Elements).

Remark 1.

Proposition 1 (Substrings are subfactors under $M^{\lhd}$ ).

Proposition 2 (Subsequences are subfactors under $M^{<}$ ).

Definition 8.

Proposition 3.

Definition 9 (The Learning Problem).

Theorem 1.