Bounds in Query Learning

Hunter Chase; James Freitag

arXiv:1904.10122·cs.LG·April 24, 2019

Bounds in Query Learning

Hunter Chase, James Freitag

PDF

Open Access

TL;DR

This paper develops new combinatorial tools to analyze query learning complexity, providing bounds, simplified proofs for learnability of language classes, and algorithms for efficient learning in various models, including randomized settings.

Contribution

Introduces new combinatorial quantities for concept classes, offering bounds and simplified proofs for learnability, and algorithms for efficient query learning in multiple models.

Findings

01

New bounds on learning complexity in query models

02

Efficient algorithms for learning regular languages

03

Connections between query learning and model theory

Abstract

We introduce new combinatorial quantities for concept classes, and prove lower and upper bounds for learning complexity in several models of query learning in terms of various combinatorial quantities. Our approach is flexible and powerful enough to enough to give new and very short proofs of the efficient learnability of several prominent examples (e.g. regular languages and regular $ω$ -languages), in some cases also producing new bounds on the number of queries. In the setting of equivalence plus membership queries, we give an algorithm which learns a class in polynomially many queries whenever any such algorithm exists. We also study equivalence query learning in a randomized model, producing new bounds on the expected number of queries required to learn an arbitrary concept. Many of the techniques and notions of dimension draw inspiration from or are related to notions from…

Equations84

C_{i}^{(x, j)} := {A \in C_{i} ∣ χ_{A} (x) = j},

C_{i}^{(x, j)} := {A \in C_{i} ∣ χ_{A} (x) = j},

B_{i} := {x \in X ∣ Ldim (C_{i}^{(x, 1)}) \geq Ldim (C_{i}^{(x, 0)})} .

B_{i} := {x \in X ∣ Ldim (C_{i}^{(x, 1)}) \geq Ldim (C_{i}^{(x, 0)})} .

C_{i + 1} := {A \in V_{i} ∣ χ_{A} (x_{i}) \neq = χ_{B_{i}} (x_{i})}

C_{i + 1} := {A \in V_{i} ∣ χ_{A} (x_{i}) \neq = χ_{B_{i}} (x_{i})}

C = (C^{(x_{0}, 1 - B (x_{0}))}) \cup \dots \cup (C^{(x_{c - 1}, 1 - B (x_{c - 1}))}),

C = (C^{(x_{0}, 1 - B (x_{0}))}) \cup \dots \cup (C^{(x_{c - 1}, 1 - B (x_{c - 1}))}),

C (C, H) < \infty iff C (C, H) \leq n .

C (C, H) < \infty iff C (C, H) \leq n .

C = H = {{a, b, c}, {a, b, d}, {a, c, d, e}, {b, c, d, e}} .

C = H = {{a, b, c}, {a, b, d}, {a, c, d, e}, {b, c, d, e}} .

A^{'} (a_{τ}) = {0 undefined ∣ τ ∣ = d otherwise

A^{'} (a_{τ}) = {0 undefined ∣ τ ∣ = d otherwise

A (x) = ⎩ ⎨ ⎧ 10 undefined x belongs to more than \frac{c - 1}{c} ∣ C_{i} ∣ many C \in C_{i} x belongs to less than \frac{1}{c} ∣ C_{i} ∣ many C \in C_{i} otherwise.

A (x) = ⎩ ⎨ ⎧ 10 undefined x belongs to more than \frac{c - 1}{c} ∣ C_{i} ∣ many C \in C_{i} x belongs to less than \frac{1}{c} ∣ C_{i} ∣ many C \in C_{i} otherwise.

V_{i}^{(x, j)} := {B \in V_{i} ∣ χ_{B} (x) = j},

V_{i}^{(x, j)} := {B \in V_{i} ∣ χ_{B} (x) = j},

A_{i} (x) = ⎩ ⎨ ⎧ 01 undefined Ldim (V_{i}^{(x, 0)}) = Ldim (V_{i}) Ldim (V_{i}^{(x, 1)}) = Ldim (V_{i}) otherwise.

A_{i} (x) = ⎩ ⎨ ⎧ 01 undefined Ldim (V_{i}^{(x, 0)}) = Ldim (V_{i}) Ldim (V_{i}^{(x, 1)}) = Ldim (V_{i}) otherwise.

Ldim (V_{i}) \geq Ldim (V_{i}^{(x_{1}, 1)}) \geq Ldim (V_{i}^{(x_{0}, 0)}) = Ldim (V_{i}),

Ldim (V_{i}) \geq Ldim (V_{i}^{(x_{1}, 1)}) \geq Ldim (V_{i}^{(x_{0}, 0)}) = Ldim (V_{i}),

V_{i + 1} := {B \in V_{i} ∣ χ_{B} (x_{i}) \neq = χ_{B_{i}} (x_{i})} .

V_{i + 1} := {B \in V_{i} ∣ χ_{B} (x_{i}) \neq = χ_{B_{i}} (x_{i})} .

C = (C^{(x_{0}, 1 - B (x_{0}))}) \cup \dots \cup (C^{(x_{c - 1}, 1 - B (x_{c - 1}))}),

C = (C^{(x_{0}, 1 - B (x_{0}))}) \cup \dots \cup (C^{(x_{c - 1}, 1 - B (x_{c - 1}))}),

LC^{E Q + M Q} (C, H) \geq LC^{E Q + M Q} (C, P (X)) \geq lo g (\frac{4}{3}) \cdot LC^{E Q} (C, P (X)) .

LC^{E Q + M Q} (C, H) \geq LC^{E Q + M Q} (C, P (X)) \geq lo g (\frac{4}{3}) \cdot LC^{E Q} (C, P (X)) .

LC^{E Q + M Q} (C_{n}, H_{n}) \geq lo g (\frac{4}{3}) \cdot d_{n},

LC^{E Q + M Q} (C_{n}, H_{n}) \geq lo g (\frac{4}{3}) \cdot d_{n},

C = C_{ϕ} = {ϕ (M; b) ∣ b \in M} .

C = C_{ϕ} = {ϕ (M; b) ∣ b \in M} .

p (y) := {ϕ^{o pp} (y; a) ∣ a \in A_{1}} \cup {\neg ϕ^{o pp} (y; a) ∣ a \in A_{0}}

p (y) := {ϕ^{o pp} (y; a) ∣ a \in A_{1}} \cup {\neg ϕ^{o pp} (y; a) ∣ a \in A_{0}}

A (x) = ⎩ ⎨ ⎧ 01 unspecified x \in A_{0} x \in A_{1} otherwise.

A (x) = ⎩ ⎨ ⎧ 01 unspecified x \in A_{0} x \in A_{1} otherwise.

H = H_{ϕ} := {ϕ (M; b^{'}) ∣ b^{'} \in N} .

H = H_{ϕ} := {ϕ (M; b^{'}) ∣ b^{'} \in N} .

ϕ (x; y) be E (x, y) \land x \neq = y

ϕ (x; y) be E (x, y) \land x \neq = y

∣ F D F A (n, m) ∣ \leq ∣ D F A_{2} (n) ∣ \cdot ∣ D F A_{2} (m) ∣^{n} .

∣ F D F A (n, m) ∣ \leq ∣ D F A_{2} (n) ∣ \cdot ∣ D F A_{2} (m) ∣^{n} .

Ldim (F D F A (n, m)) \leq lo g (∣ D F A_{2} (n) ∣ \cdot ∣ D F A_{2} (m) ∣^{n})

Ldim (F D F A (n, m)) \leq lo g (∣ D F A_{2} (n) ∣ \cdot ∣ D F A_{2} (m) ∣^{n})

u (A, a) = Ldim (C) - Ldim (C_{a = A (a)}) .

u (A, a) = Ldim (C) - Ldim (C_{a = A (a)}) .

u (A, a) + u (B, a) \geq 1.

u (A, a) + u (B, a) \geq 1.

d (A, B) + d (B, A)

d (A, B) + d (B, A)

a \in Δ (A_{i}, A_{i + 1}) \sum \frac{μ ( a )}{μ ( Δ ( A _{i} , A _{i + 1} ))} u (A_{i}, a) \leq \frac{1}{2},

a \in Δ (A_{i}, A_{i + 1}) \sum \frac{μ ( a )}{μ ( Δ ( A _{i} , A _{i + 1} ))} u (A_{i}, a) \leq \frac{1}{2},

a \in Δ (A_{i}, A_{i + 1}) \sum μ (a) u (A_{i}, a) \leq \frac{1}{2} μ (Δ (A_{i}, A_{i + 1})) .

a \in Δ (A_{i}, A_{i + 1}) \sum μ (a) u (A_{i}, a) \leq \frac{1}{2} μ (Δ (A_{i}, A_{i + 1})) .

X = {A_{1}, \dots, A_{l}} .

X = {A_{1}, \dots, A_{l}} .

D (G, H) := {a \in X ∣ \forall A_{1}, B_{1} \in G, \forall A_{2}, B_{2} \in H, A_{1} (a) = B_{1} (a), A_{2} (a) = B_{2} (a), A_{1} (a) \neq = A_{2} (a)} .

D (G, H) := {a \in X ∣ \forall A_{1}, B_{1} \in G, \forall A_{2}, B_{2} \in H, A_{1} (a) = B_{1} (a), A_{2} (a) = B_{2} (a), A_{1} (a) \neq = A_{2} (a)} .

i = 1 \sum l G, H a partition of X, A_{i} \in G, A_{i + 1} \in H \sum a \in D (G, H) \sum μ (a) u (A_{i}, a) .

i = 1 \sum l G, H a partition of X, A_{i} \in G, A_{i + 1} \in H \sum a \in D (G, H) \sum μ (a) u (A_{i}, a) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Algorithms and Data Compression · Optimization and Search Problems

Full text

Bounds in Query Learning

Hunter Chase

Department of Mathematics, UIC, Chicago IL

[email protected]

and

James Freitag

Department of Mathematics, UIC, Chicago IL

[email protected]

Abstract.

We introduce new combinatorial quantities for concept classes, and prove lower and upper bounds for learning complexity in several models of query learning in terms of various combinatorial quantities. Our approach is flexible and powerful enough to enough to give new and very short proofs of the efficient learnability of several prominent examples (e.g. regular languages and regular $\omega$ -languages), in some cases also producing new bounds on the number of queries. In the setting of equivalence plus membership queries, we give an algorithm which learns a class in polynomially many queries whenever any such algorithm exists.

We also study equivalence query learning in a randomized model, producing new bounds on the expected number of queries required to learn an arbitrary concept. Many of the techniques and notions of dimension draw inspiration from or are related to notions from model theory, and these connections are explained. We also use techniques from query learning to mildly improve a result of Laskowski regarding compression schemes.

Partially supported by NSF grant no. 1700095

1. Introduction

Fix a set $X$ and denote by $\mathcal{P}(X)$ the collection of all subsets of $X$ . A concept class111We will also sometimes call $\mathcal{C}$ a set system on $X$ $\mathcal{C}$ on $X$ is a subset of $\mathcal{P}(X)$ . In the equivalence query (EQ) learning model, a learner attempts to identify a target set $A\in\mathcal{C}$ by means of a series of data requests called equivalence queries. The learner has full knowledge of $\mathcal{C}$ , as well as a hypothesis class $\mathcal{H}$ with $\mathcal{C}\subseteq\mathcal{H}\subseteq\mathcal{P}(X).$ An equivalence query consists of the learner submitting a hypothesis $B\in\mathcal{H}$ to a teacher, who either returns yes if $A=B$ , or a counterexample $x\in A\triangle B$ . In the former case, the learner has learned $A$ , and in the latter case, the learner uses the new information to update and submit a new hypothesis. In sections 2 and 3, the teacher may be assumed to be adversarial and the worst case number of queries required to learn any concept is analyzed. In section 4, we consider the case in which the teacher selects counterexamples randomly according to a fixed but arbitrary distribution.

We will also consider learning with equivalence and membership queries (EQ+MQ). In a membership query, a learner submits a single element $x$ from the base set $X$ to the teacher, who returns the value $A(x)$ , where $A$ is the target concept. In this setting, the learner may choose to make either type of query at any stage, submitting any $x\in X$ for a membership query or submitting any $B\in\mathcal{H}$ for an equivalence query. The learner learns the target concept $A$ when they submit $A$ as an equivalence query.

With Theorems 2.6 and 2.24, we give upper bounds for the number of queries required for EQ and EQ+MQ learning a class $\mathcal{C}$ with hypotheses $\mathcal{H}$ in terms of the Littlestone dimension of $\mathcal{C}$ , denoted $\operatorname{Ldim}(\mathcal{C})$ , and the consistency dimension of $\mathcal{C}$ with respect to $\mathcal{H}$ , denoted $\operatorname{C}(\mathcal{C},\mathcal{H})$ . We also give lower bounds for the number of required queries in terms of these quantities. In the EQ+MQ setting, the bounds are tight enough to completely characterize when a problem is efficiently learnable. Littlestone dimension is well-known in learning theory [18] and model theory.222In model theory, Littlestone dimension is called Shelah 2-rank, see [10] for additional details.

Consistency dimension and the related notion of strong consistency dimension are more subtle, which we detail in section 2. When $\mathcal{H}$ is taken to be $\mathcal{P}(X)$ , $\operatorname{C}(\mathcal{C},\mathcal{H})=1$ ; for various examples of set systems with $\mathcal{H}=\mathcal{C}$ , one has $\operatorname{C}(\mathcal{C},\mathcal{H})=\infty$ . In 2.2, we define a new invariant, the consistency threshold of $\mathcal{C}$ , and provide a construction (for arbitrary $\mathcal{C}$ ) of a hypothesis class $\mathcal{H}$ which is not much more complicated than $\mathcal{C}$ (of the same Littlestone dimension as $\mathcal{C}$ ) such that $\operatorname{C}(\mathcal{C},\mathcal{H})\leq\operatorname{Ldim}(\mathcal{C})+1.$ In 2.3, we compare our bounds and invariants to those previously appearing in the literature.

Theorems 2.6 and 2.24 can be used to establish efficient learnability in specific applied settings if one can obtain appropriate bounds on Littlestone dimension and consistency dimension. Let $(\mathcal{C}_{n},\mathcal{H}_{n})$ be a collection of concept and hypothesis classes which depends on some parameter $n$ . Typically, we are thinking of finite classes which grow with $n$ . We prove that whenever $\mathcal{C}_{n}$ can be learned by an algorithm using polynomially many membership queries and equivalence queries from $\mathcal{H}_{n}$ , there must be polynomial bounds on Littlestone and consistency dimension. Moreover, whenever such an algorithm exists, the algorithm given in Theorem 2.24 accomplishes this.

Finally, to close section 2, we explain the connection between strong consistency dimension and a model theoretic property called the finite cover property (fcp), or rather its negation, referred to henceforth as the nfcp. We show that if $\mathcal{C}$ is the set system given by uniform instances of a fixed first order formula $\phi$ , and $\mathcal{H}$ is the collection of externally $\phi$ -definable sets, then $(\mathcal{C},\mathcal{H})$ has finite strong consistency dimension if and only if $\phi$ has the nfcp.

In section 3 we demonstrate the practicality of our approach by providing simple and fast proofs of the efficient learnability of regular languages and certain $\omega$ -languages, reproving results of [1, 5, 12, 11]. Besides the conceptual simplicity of the approach, the bounds in learning complexity resulting from our algorithm have some novel aspects. For instance, our bounds have no dependence on the length of the strings provided to the learner as counterexamples, in contrast to existing algorithms.

In section 4 we turn to a randomized variant of EQ-learning in which the teacher is required to choose counterexamples randomly from a known probability distribution on $X$ . [4] show that for a concept class of size $n$ , there is an algorithm in which the expected number of queries to learn any concept is at most $\log_{2}(n).$ It is natural to wonder whether there is a notion of dimension which can be used to bound the expected number of queries. In fact, Angluin and Dohrn [4, Theorem 25] already consider this, and show that the VC-dimension of the concept class is a lower bound on the number of expected queries. However, [4, Theorem 26], using an example of [18], shows that the VC-dimension cannot provide an upper bound for the number of queries. We show that the Littlestone dimension provides such an upper bound; we give an algorithm which yields a bound which is linear in the Littlestone dimension for the expected number of queries needed to learn any concept.

In section 5, we introduce compression schemes for concept classes. Specifically, the notion we work with is equivalent to $d$ -compression with $b$ extra bits (of Floyd and Warmuth [13]). In [17], Laskowski and Johnson proved that the concept class corresponding to a stable formula has an extended $d$ -compression for some $d$ . Later, a result of Laskowski appearing as [14, Theorem 4.1.3] in fact showed that one could take $d$ equal to the Shelah 2-rank (Littlestone dimension) and uses $2^{d}$ many reconstruction functions. We show that $d+1$ many reconstruction functions suffice.

2. A combinatorial characterization of EQ-learnability

Often, one assumes that $X$ is finite, and the emphasis is placed on finding bounds on the number of queries it may take to learn any $A\in\mathcal{C}$ . We also consider the case where $X$ is infinite, for which we give the following definition.

Definition 2.1.

Let $\mathcal{C}$ and $\mathcal{H}$ be set systems on a set $X$ . $\mathcal{C}$ is learnable with equivalence queries from $\mathcal{H}$ if there exists some $n<\omega$ and some algorithm to submit hypotheses from $\mathcal{H}$ such that any concept $A\in\mathcal{C}$ is learnable in at most $n$ equivalence queries, given any teacher returning counterexamples. Let $\operatorname{LC}^{EQ}(\mathcal{C},\mathcal{H})$ be the least such $n$ if $\mathcal{C}$ is learnable with equivalence queries from $\mathcal{H}$ , and $\operatorname{LC}^{EQ}(\mathcal{C},\mathcal{H})=\infty$ otherwise.

$\operatorname{LC}^{EQ}(\mathcal{C},\mathcal{H})$ is called the learning complexity, representing the optimal number of queries needed in the worst-case scenario.

Similarly, $\mathcal{C}$ is learnable with equivalence queries from $\mathcal{H}$ and membership queries if there exists some $n<\omega$ and some algorithm to submit membership queries from $X$ or equivalence queries from $\mathcal{H}$ such that any concept $A\in\mathcal{C}$ is learnable in at most $n$ equivalence queries. The learning complexity is defined similarly and is denoted by $\operatorname{LC}^{EQ+MQ}(\mathcal{C},\mathcal{H})$ .

2.1. EQ-learnability from Littlestone and consistency dimension

Proposition 2.2.

[18, Theorems 5 and 6]** If $\operatorname{LC}^{EQ}(\mathcal{C},\mathcal{H})\leq d+1$ , then $\operatorname{Ldim}(\mathcal{C})\leq d$ . If $\mathcal{H}=\mathcal{P}(X)$ , then the converse holds.

Proof.

Suppose $\operatorname{Ldim}(\mathcal{C})\geq d+1$ . We show that we can force the learner to use at least $d+2$ equivalence queries. Construct a binary element tree of height $d+1$ with proper labels from $\mathcal{C}$ witnessing $\operatorname{Ldim}(\mathcal{C})\geq d+1$ . Given the first hypothesis $H_{0}$ from the learner, return the element on the 0th level on the tree as a counterexample. Continue this, returning the element on the $i$ th level along the path consistent with previous counterexamples as the counterexample to hypothesis $H_{i}$ . We will return $d+1$ counterexamples, and the learner still requires one more hypothesis to identify the concept. Since this will occur for one of the proper labels $A$ of the binary element tree, we have forced the learner to use at least $d+2$ equivalence queries for some $A\in\mathcal{C}$ .

Suppose $\operatorname{Ldim}(\mathcal{C})=d<\infty$ . Let $\mathcal{C}_{0}=\mathcal{C}$ . Inductively define $\mathcal{C}_{i}$ , $i=1,\ldots,d$ as follows. Given $\mathcal{C}_{i}$ , for any $x\in X$ and $j\in\{0,1\}$ , let

[TABLE]

where $\chi_{A}$ is the characteristic function on $A$ . Let

[TABLE]

Submit $B_{i}$ as the hypothesis. If $B_{i}$ is correct, we are done. Otherwise, we receive a counterexample $x_{i}$ . Set

[TABLE]

to be the concepts which have the correct label for $x_{i}$ . Observe that at each stage, $\operatorname{Ldim}(\mathcal{C}_{i+1})<\operatorname{Ldim}(\mathcal{C}_{i})$ . Therefore, if we make $d$ queries without correctly identifying the target, then we must have $\operatorname{Ldim}(\mathcal{C}_{d})=0$ . Then $V_{d}$ is a singleton, which must be the target concept. ∎

Notice in particular that if $\operatorname{Ldim}(\mathcal{C})=\infty$ , then $\mathcal{C}$ cannot be learned with equivalence queries, even with $\mathcal{H}=\mathcal{P}(X)$ . The assumption that $\mathcal{H}=\mathcal{P}(X)$ makes learning straightforward, but this may be too strong for many settings. However, without some additional hypotheses on $\mathcal{H}$ , learnability may already be hopeless, even for very simple set systems. For instance, let $\mathcal{C}$ be the set of singletons. If $\mathcal{H}=\mathcal{C}$ , then we may take as long as $|X|$ to learn if $X$ is finite, or never learn at all if $X$ is infinite. However, if the learner is allowed to guess $\emptyset$ , this forces the teacher to identify the target singleton.

The strategy of Proposition 2.2 permeates both learnability and non-learnability proofs; identifying a specific set amounts to reducing the Littlestone dimension of the family of possible concepts to 0; actually submitting the target concept before the Littlestone dimension reaches 0 can be thought of as a best-case scenario that we cannot rely on. Non-learnability then amounts to an inability to reduce the Littlestone dimension of the family of possible concepts to 0 through a series of finitely many equivalence queries. The main purpose of this section is to give precise conditions on $\mathcal{H}$ and $\mathcal{C}$ which characterize learnability.

Definition 2.3.

Given a set $X$ , a partially specified subset $A$ of $X$ is a partial function $A:X\rightarrow\{0,1\}$ .

•

Say $x\in A$ if $A(x)=1$ , $x\notin A$ if $A(X)=0$ , and membership of $x$ is unspecified otherwise. The domain of $A$ , $\operatorname{dom}(A)$ , is $A^{-1}(\{0,1\})$ . Call $A$ total if $\operatorname{dom}(A)=X$ . We identify subsets $A\subseteq X$ with total partially specified subsets. The size of $A$ , $|A|$ , is the cardinality of $\operatorname{dom}(A)$ .

•

Given two partially specified subsets $A$ and $B$ , write $A\sqsubseteq B$ if $A$ and $B$ agree on $\operatorname{dom}(A)$ ; call $A$ a restriction of $B$ and $B$ an extension of $A$ .

•

Given a set $Y\subseteq\operatorname{dom}(A)$ , the restriction $A|_{Y}$ of $A$ to $Y$ is the partial function where $A|_{Y}(x)=A(x)$ for all $x\in Y$ , and is unspecified otherwise.

•

Given a set system $\mathcal{C}$ on $X$ , $A$ is $n$ -consistent with $\mathcal{C}$ if every size $n$ restriction of $A$ has an extension in $\mathcal{C}$ . Otherwise, say $A$ is $n$ -inconsistent. $A$ is finitely consistent with $\mathcal{C}$ if every restriction of $A$ of finite size has an extension in $\mathcal{C}$ —that is, $A$ is $n$ -consistent with $\mathcal{C}$ for all $n<\omega$ .

The following definition is a translation into set systems of a definition that first appeared in [8].

Definition 2.4.

The consistency dimension of $\mathcal{C}$ with respect to $\mathcal{H}$ , denoted $\operatorname{C}(\mathcal{C},\mathcal{H})$ , is the least integer $n$ such that for every subset $A\subseteq X$ (viewed as a total partially specified subset), if $A$ is $n$ -consistent with $\mathcal{C}$ , then $A\in\mathcal{H}$ . If no such $n$ exists, then say $\operatorname{C}(\mathcal{C},\mathcal{H})=\infty$ .

Observe that $\operatorname{C}(\mathcal{C},\mathcal{H})=1$ iff $\mathcal{H}$ shatters333Recall that a set system $\mathcal{C}$ shatters a set $A$ if, for all $B\subseteq A$ , there is $C\in\mathcal{C}$ such that $C\cap A=B$ . the set of all elements $x\in X$ such that there are $A_{0}$ and $A_{1}$ in $\mathcal{C}$ such that $x\notin A_{0}$ but $x\in A_{1}$ . In this case, it is possible to learn any concept in $\mathcal{C}$ in at most $\operatorname{Ldim}(\mathcal{C})+1$ equivalence queries, using the method of Proposition 2.2. So we may assume that $\operatorname{C}(\mathcal{C},\mathcal{H})>1$ .

Lemma 2.5.

Suppose that for each $i<n$ , $\mathcal{C}_{i}$ is a concept class on $X$ and $\mathcal{H}_{i}$ is a hypothesis class on $X$ . Suppose that $\operatorname{LC}^{EQ}(\mathcal{C}_{i},\mathcal{H}_{i})=m_{i}$ . Then $\operatorname{LC}^{EQ}(\mathcal{C},\mathcal{H})\leq\sum_{i<n}m_{i}$ , where $\mathcal{C}:=\cup_{i<n}\mathcal{C}_{i}$ and $\mathcal{H}:=\cup_{i<n}\mathcal{H}_{i}$ .

Proof.

We give the proof for $n=2$ ; then the result for $n>2$ follows easily by induction.

To learn a target concept $A\in\mathcal{C}=\mathcal{C}_{0}\cup\mathcal{C}_{1}$ with hypotheses from $\mathcal{H}=\mathcal{H}_{0}\cup\mathcal{H}_{1}$ , begin by assuming that $A\in\mathcal{C}_{0}$ . Attempt to learn $A$ by making guesses from $\mathcal{H}_{0}$ , according to the procedure by which any concept in $\mathcal{C}_{0}$ is learnable in at most $m_{0}$ many queries. If, after making $m_{0}$ many queries, we have failed to learn $A$ , then we conclude that $A\notin\mathcal{C}_{0}$ , whence $A\in\mathcal{C}_{1}$ . We can then learn $A$ in at most $m_{1}$ many additional queries with guesses from $\mathcal{H}_{1}$ . ∎

We can now give an upper bound for the learning complexity in terms of Littlestone dimension and consistency dimension.

Theorem 2.6.

Suppose $\operatorname{Ldim}(\mathcal{C})=d<\infty$ and $1<\operatorname{C}(\mathcal{C},\mathcal{H})=c<\infty$ . Then $\operatorname{LC}^{EQ}(\mathcal{C},\mathcal{H})\leq c^{d}$ .

Proof.

We proceed by induction on $d$ . The base case, $d=0$ , is trivial, as then $\mathcal{C}$ is a singleton.

Suppose there is some element $x$ such that $\operatorname{Ldim}(\mathcal{C}\cap x)<d+1$ and $\operatorname{Ldim}(\mathcal{C}\setminus x)<d+1$ , where $\mathcal{C}\cap x:=\{A\in\mathcal{C}\,|\,x\in A\}$ and $\mathcal{C}\setminus x:=\{A\in\mathcal{C}\,|\,x\notin A\}$ . Then by induction, any concept in $\mathcal{C}\cap x$ can be learned in at most $c^{d}$ queries with guesses from $\mathcal{H}$ , and the same is true for $\mathcal{C}\setminus x$ . Then by Lemma 2.5, any concept in $\mathcal{C}$ can be learned in at most $2c^{d}\leq c^{d+1}$ equivalence queries.

If no such $x$ exists, then for all $x$ , either $\operatorname{Ldim}(\mathcal{C}\cap x)=d+1$ or $\operatorname{Ldim}(\mathcal{C}\setminus x)=d+1$ . Let $B$ be such that $x\in B$ iff $\operatorname{Ldim}(\mathcal{C}\cap x)=d+1$ .

If $B\in\mathcal{H}$ , then we submit $B$ as our query. If we are incorrect, then by choice of $B$ , the class $\mathcal{C}^{\prime}$ of concepts consistent with the counterexample $x_{0}$ will have Littlestone dimension $\leq d$ . By induction, any concept in $\mathcal{C}^{\prime}$ can be learned in at most $c^{d}$ many queries, and so we learn $a$ in at most $c^{d}+1\leq c^{d+1}$ queries.

If $B\notin H$ , then, since $\operatorname{C}(\mathcal{C},\mathcal{H})=c$ , there are some $x_{0},\ldots,x_{c-1}$ such that there is no $A\in\mathcal{C}$ such that $B|_{\{x_{0},\ldots,x_{c-1}\}}\sqsubseteq A$ . Then, with notation as in the proof of Proposition 2.2,

[TABLE]

and $\operatorname{Ldim}(\mathcal{C}^{(x_{i},1-B(x_{i}))})\leq d$ for each $i$ . Then, by induction, for each $i$ , any concept in $\mathcal{C}^{(x_{i},1-B(x_{i}))}$ can be learned in at most $c^{d}$ many queries with guesses from $\mathcal{H}$ . By Lemma 2.5, any concept in $\mathcal{C}$ can be learned in at most $c^{d+1}$ many queries with guesses from $\mathcal{H}$ . ∎

On the other hand, Proposition 2.2 gives a lower bound of $\operatorname{Ldim}(\mathcal{C})+1\leq\operatorname{LC}^{EQ}(\mathcal{C},\mathcal{H})$ . There is also a lower bound for learning complexity in terms of consistency dimension:

Proposition 2.7.

[8, Theorem 2]** Suppose there is some partially specified subset $A$ which is $n$ -consistent with $\mathcal{C}$ but which does not have a total extension in $\mathcal{H}$ . Then $n<\operatorname{LC}^{EQ}(\mathcal{C},\mathcal{H})$ .

Proof.

By hypothesis, given any equivalence query $H$ , the teacher can find some $x\in\operatorname{dom}(A)$ such that $H(x)\neq A(x)$ . Moreover, since $A$ is $n$ -consistent with $\mathcal{C}$ , the teacher is able to return a counterexample of this form for the first $n$ equivalence queries. Thus $\mathcal{C}$ cannot be learned with fewer than $n+1$ equivalence queries from $\mathcal{H}$ . ∎

In particular, if $\operatorname{C}(\mathcal{C},\mathcal{H})\geq c$ , then there is some subset $A$ which is $(c-1)$ -consistent with $\mathcal{C}$ but which does not belong to $\mathcal{H}$ . Then $c\leq\operatorname{LC}^{EQ}(\mathcal{C},\mathcal{H})$ . So $\operatorname{C}(\mathcal{C},\mathcal{H})\leq\operatorname{LC}^{EQ}(\mathcal{C},\mathcal{H})$ . In fact, we will obtain a stronger bound using strong consistency dimension in section 2.3.

Furthermore, if $\operatorname{C}(\mathcal{C},\mathcal{H})=\infty$ , then $\mathcal{C}$ cannot be learned with equivalence queries from $\mathcal{H}$ . Combining Theorem 2.6 and Propositions 2.2 and 2.7, we obtain the following:

Theorem 2.8.

$\mathcal{C}$ * is learnable with equivalence queries from $\mathcal{H}$ iff $\operatorname{Ldim}(\mathcal{C})<\infty$ and $\operatorname{C}(\mathcal{C},\mathcal{H})<\infty$ .*

2.2. Obtaining finite consistency dimension

We have established that finite consistency dimension is essential for EQ-learning. The central question we answer in this subsection is: given $\mathcal{C}$ , can one obtain a hypothesis class $\mathcal{H}$ which is not much more complicated than $\mathcal{C}$ with the property that $\operatorname{C}(\mathcal{C},\mathcal{H})$ is finite?

Definition 2.9.

Fix a set system $\mathcal{C}$ on a set $X$ . $\mathcal{C}$ has consistency threshold $n<\infty$ if, given any hypothesis class $\mathcal{H}\supset\mathcal{C}$ , we have that

[TABLE]

Lemma 2.10.

Suppose $A$ is a partially specified subset finitely consistent with $\mathcal{C}$ . Then there is a total extension $A^{\prime}\sqsupseteq A$ finitely consistent with $\mathcal{C}$ .

Proof.

Let $X=\{x_{\alpha}\,|\,\alpha<|X|\}$ be a well-ordering of $X$ . Let $A_{0}=A$ . We inductively define a $\sqsubseteq$ -chain of partially specified subsets $A_{\alpha}$ , where each $A_{\alpha}$ is defined on $\operatorname{dom}(A)\cup\{x_{\xi}\,|\,\xi<\alpha\}$ and is finitely consistent with $\mathcal{C}$ . For $\alpha$ a limit ordinal, set $A_{\alpha}=\cup_{\xi<\alpha}A_{\xi}$ . It is clear that $A_{\alpha}$ is finitely consistent with $\mathcal{C}$ if all $A_{\xi}$ for $\xi<\alpha$ are.

At any successor stage $\alpha+1$ , if $x_{\alpha}\in\operatorname{dom}(A_{\alpha})$ , set $A_{\alpha+1}=A_{\alpha}$ . Otherwise, we must extend $A_{\alpha}$ to $x_{\alpha}$ while remaining finitely consistent with $\mathcal{C}$ . Assume for contradiction that neither $B_{0}:=A_{\alpha}\cup\{x_{\alpha}\mapsto 0\}$ nor $B_{1}:=A_{\alpha}\cup\{x_{\alpha}\mapsto 1\}$ are finitely consistent with $\mathcal{C}$ . Then there are finite sets $Y_{0},Y_{1}\subseteq\operatorname{dom}(A_{\alpha})$ such that $B_{0}|_{Y_{0}\cup\{a_{\alpha}\}}$ and $B_{1}|_{Y_{1}\cup\{a_{\alpha}\}}$ have no extension in $\mathcal{C}$ . But $A_{\alpha}|_{Y_{0}\cup Y_{1}}$ has an extension $B$ in $\mathcal{C}$ , and $B$ must be an extension of either $B_{0}|_{Y_{0}\cup\{a_{\alpha}\}}$ or $B_{1}|_{Y_{1}\cup\{a_{\alpha}\}}$ , a contradiction. So $A_{\alpha}$ has a finitely consistent extension to $x_{\alpha}$ , and we set $A_{\alpha+1}$ to be such an extension.

We then take $A^{\prime}=\cup_{\xi<|X|}A_{\xi}$ . ∎

Proposition 2.11.

Let $\mathcal{C},\mathcal{H}$ be a set systems and let $A$ be a partially specified subset. The following are equivalent:

**(i): **

$A$ * is finitely consistent with $\mathcal{C}$ .*

**(ii): **

If $\operatorname{C}(\mathcal{C},\mathcal{H})<\infty$ , then there is a total extension $A^{\prime}\sqsupseteq A$ in $\mathcal{H}$ .

Proof.

(i) $\Rightarrow$ (ii): Let $A^{\prime}\sqsupseteq A$ be a total extension finitely consistent with $\mathcal{C}$ . If $\operatorname{C}(\mathcal{C},\mathcal{H})<\infty$ , then $A^{\prime}\in\mathcal{H}$ .

(ii) $\Rightarrow$ (i): We show the contrapositive. Suppose that $A$ is not finitely consistent with $\mathcal{C}$ , witnessed by some size $n$ restriction $A_{0}$ , which is a $\sqsubseteq$ -minimal such restriction. We find some $\mathcal{H}$ such that $\operatorname{C}(\mathcal{C},\mathcal{H})<\infty$ but $\mathcal{H}$ contains no total extension of $A$ . Let $\mathcal{H}$ be the collection of all (total partially specified) subsets which are not extensions of $A_{0}$ . So $A$ has no total extension in $\mathcal{H}$ . We claim that $\operatorname{C}(\mathcal{C},\mathcal{H})\leq n$ . Indeed, observe that given any (total partially specified) subset $B$ that is $n$ -consistent with $\mathcal{C}$ , we have $A_{0}\not\sqsubseteq B$ , and then $B\in\mathcal{H}$ .

∎

In particular, if $\operatorname{C}(\mathcal{C},\mathcal{H})<\infty$ , then $\mathcal{H}$ contains all finitely consistent subsets. That is, extensions of all finitely consistent partially specified subsets (equivalently, by Lemma 2.10, all finitely consistent total partially specified subsets) are necessary to obtain $\operatorname{C}(\mathcal{C},\mathcal{H})<\infty$ . Consistency threshold classifies when this is a sufficient condition.

Proposition 2.12.

The following are equivalent:

**(i): **

$\mathcal{C}$ * has consistency threshold $\leq n<\infty$ .*

**(ii): **

For all (total partially specified) subsets $A$ , if $A$ is $n$ -consistent with $\mathcal{C}$ , then $A$ is finitely consistent with $\mathcal{C}$ .

**(iii): **

If $\mathcal{H}$ contains all finitely consistent (total partially specified) subsets, then $\operatorname{C}(\mathcal{C},\mathcal{H})\leq n$ .

Proof.

(i) $\Rightarrow$ (ii): Assume for contradiction that there is some total $A$ which is $n$ -consistent but not finitely consistent. Let $m$ be minimal such that $A$ is $m$ -inconsistent. Then there is a size $m$ restriction $A^{\prime}\sqsubseteq A$ that has no extension in $\mathcal{C}$ . Then let $\mathcal{H}$ contain all subsets which do not extend $A^{\prime}$ .

We claim that $\operatorname{C}(\mathcal{C},\mathcal{H})=m$ . Note that $A$ witnesses that $\operatorname{C}(\mathcal{C},\mathcal{H})\geq m$ . On the other hand, observe that given any partially specified subset $B$ that is $m$ -consistent with $\mathcal{C}$ , we have $A^{\prime}\not\sqsubseteq B$ , and then it is easy to see that $B$ has a total extension in $\mathcal{H}$ .

(ii) $\Rightarrow$ (iii): If $\mathcal{H}$ contains all finitely consistent subsets, and all $n$ -consistent subsets are finitely consistent, then $\operatorname{C}(\mathcal{C},\mathcal{H})\leq n$ holds immediately.

(iii) $\Rightarrow$ (i): By Proposition 2.11, if $\operatorname{C}(\mathcal{C},\mathcal{H})<\infty$ , then $\mathcal{H}$ already has all finitely consistent subsets. Then $\operatorname{C}(\mathcal{C},\mathcal{H})\leq n$ . ∎

In particular, if $\mathcal{C}$ has finite consistency threshold, then $\operatorname{C}(\mathcal{C},\mathcal{H})<\infty$ iff $\mathcal{H}$ contains all finitely consistent subsets.

Corollary 2.13.

Suppose $\mathcal{C}$ does not have finite consistency threshold. Then for arbitrarily large $n$ , there is some total subset $A_{n}$ which is $n$ -consistent but not $(n+1)$ -consistent with $\mathcal{C}$ .

Finite consistency threshold is not strictly necessary to provide a positive answer to the central question of this subsection; nevertheless, it does identify a clear qualitative dividing line. When $\mathcal{C}$ has finite consistency threshold, $\mathcal{H}$ only needs to contain all finitely consistent subsets; letting $\mathcal{H}_{\infty}$ be the set of all finitely consistent subsets, we obtain a minimum hypothesis class such that learning is possible.

Where $\mathcal{C}$ does not have finite consistency threshold, more is required; we must add some hypotheses which are inconsistent with the concepts in $\mathcal{C}$ , and there is no minimal $\mathcal{H}$ such that learning is possible. However, for each $m$ , we can replace “finitely consistent” with “ $m$ -consistent” to obtain a class $\mathcal{H}_{m}$ such that $\operatorname{C}(\mathcal{C},\mathcal{H}_{m})\leq m$ —let $\mathcal{H}_{m}$ be the collection of all subsets which are $m$ -consistent with $\mathcal{C}$ . Note that $\mathcal{H}_{m}$ is clearly the minimum hypothesis class such that $\operatorname{C}(\mathcal{C},\mathcal{H})\leq m$ .

Note that for all $m$ , $\mathcal{H}_{\infty}\subseteq\mathcal{H}_{m}$ . By Proposition 2.12, if $\mathcal{C}$ has consistency threshold $n$ , then for all $m\geq n$ , $\mathcal{H}_{m}=\mathcal{H}_{n}=\mathcal{H}_{\infty}$ . If $\mathcal{C}$ does not have finite consistency threshold, there is no minimal $\mathcal{H}$ such that $\operatorname{C}(\mathcal{C},\mathcal{H})<\infty$ ; by Corollary 2.13, if $\operatorname{C}(\mathcal{C},\mathcal{H})=m$ , then there is $m^{\prime}\geq m$ such that $\mathcal{H}_{m^{\prime}}\mathchar 13608\relax\mathcal{H}$ .

By choosing $m$ appropriately, given any $\mathcal{C}$ , we can find a hypothesis class such that $\operatorname{C}(\mathcal{C},\mathcal{H})<\infty$ without increasing the Littlestone dimension; that is, $\operatorname{Ldim}(\mathcal{H})=\operatorname{Ldim}(\mathcal{C})$ .

Theorem 2.14.

Suppose $\operatorname{Ldim}(\mathcal{C})=d<\infty$ . Then there is $\mathcal{H}$ such that $\operatorname{C}(\mathcal{C},\mathcal{H})<\infty$ and $\operatorname{Ldim}(\mathcal{H})=\operatorname{Ldim}(\mathcal{C})$ . Furthermore, we can find such an $\mathcal{H}$ such that $\operatorname{C}(\mathcal{C},\mathcal{H})\leq\operatorname{Ldim}(\mathcal{C})+1$ .

Proof.

Fix some $m>d=\operatorname{Ldim}(\mathcal{C})$ . Let $\mathcal{H}_{m}$ be the collection of all subsets which are $m$ -consistent with $\mathcal{C}$ . It is immediate that $\operatorname{C}(\mathcal{C},\mathcal{H}_{m})\leq m<\infty$ .

Assume for contradiction that $\operatorname{Ldim}(\mathcal{H}_{m})>\operatorname{Ldim}(\mathcal{C})$ . Consider a binary element tree of height $d+1$ that can be properly labeled with elements of $\mathcal{H}_{m}$ ; in particular, there is some leaf which cannot be labeled with an element of $\mathcal{C}$ . Consider such a leaf. The path through the binary element tree to this leaf defines a partially specified subset $A$ that is $(d+1)$ -inconsistent with $\mathcal{C}$ . In particular, any total extension is $(d+1)$ -inconsistent, so $m$ -inconsistent, and so does not belong to $\mathcal{H}_{m}$ . This contradicts our ability to label the leaf with an element of $\mathcal{H}_{m}$ .

In particular, recall that when $\mathcal{C}$ has finite consistency threshold $n$ , $A$ is $n$ -consistent with $\mathcal{C}$ iff it is finitely consistent with $\mathcal{C}$ . So setting $\mathcal{H}_{m}$ as above with $m$ at least the finite consistency threshold amounts to setting $\mathcal{H}_{m}$ to be the collection of all finitely consistent partially specified subsets. In this case, $\operatorname{Ldim}(\mathcal{H}_{m})=\operatorname{Ldim}(\mathcal{C})$ even if $m\leq d$ , as increasing the Littlestone dimension requires adding something inconsistent with $\mathcal{C}$ .

Regardless of whether $\mathcal{C}$ has finite consistency dimension, we can let $m=d+1$ . Then $\operatorname{C}(\mathcal{C},\mathcal{H}_{m})\leq m=d+1$ . ∎

2.3. From consistency to strong consistency

From an algorithms perspective, the result of Theorem 2.6 is unsatisfactory, since it is exponential in $\operatorname{Ldim}(\mathcal{C})$ . We give an example to show that, without modification, we cannot expect a significant improvement.

Example 2.15.

Fix $c>2$ and $d$ . Let $\{a_{\tau}\,|\,\tau\in[c]^{i},\,1\leq i\leq d\}$ be distinct elements indexed by finite nonempty sequences of length at most $d$ from $[c]$ . For $\sigma\in[c]^{d}$ , let $B_{\sigma}=\{a_{\tau}\,|\,\tau\subseteq\sigma\}$ . Let $\mathcal{C}=\{B_{\sigma}\,|\,\sigma\in[c]^{d}\}$ . Then $\operatorname{Ldim}(\mathcal{C})=d$ .

If we take $\mathcal{C}$ to also be our hypothesis class, then $\operatorname{C}(\mathcal{C},\mathcal{C})=c+1$ . Indeed, the (total partially specified) subset $A=\{a_{0}\}$ is $c$ -consistent but not $(c+1)$ consistent with $\mathcal{C}$ , witnessed by the restriction of $A$ to $\{a_{0},a_{0,0},\ldots,a_{0,c-1}\}$ , so $\operatorname{C}(\mathcal{C},\mathcal{C})\geq c+1$ . On the other hand, if $A$ is a subset $(c+1)$ -consistent with $\mathcal{C}$ , then, by induction on the length of $\tau$ , for each $1\leq i\leq d$ , $A$ contains exactly one $a_{\tau}$ with $\tau=i$ , so $A\in\mathcal{C}$ .

However, it may take as long as $c^{d}$ many equivalence queries to learn; if the teacher returns $a_{\sigma}$ as a counterexample to hypothesis $A_{\sigma}$ , then the learner can only eliminate $A_{\sigma}$ .

The most promising modification is the following variant of consistency dimension, which also appeared in [8] in a slightly different form.

Definition 2.16.

The strong consistency dimension of $\mathcal{C}$ with respect to $\mathcal{H}$ , denoted $\operatorname{SC}(\mathcal{C},\mathcal{H})$ , is the least integer $n$ such that for every partially specified subset $A$ , if $A$ is $n$ -consistent with $\mathcal{C}$ , then $A$ has an extension in $\mathcal{H}$ . If no such $n$ exists, then say $\operatorname{SC}(\mathcal{C},\mathcal{H})=\infty$ .

We therefore make the stronger requirement that all partially specified subsets that are $n$ -consistent be consistent, rather than just all totally partially specified subsets. It is immediate from the definition that $\operatorname{C}(\mathcal{C},\mathcal{H})\leq\operatorname{SC}(\mathcal{C},\mathcal{H})$ . At the smallest levels, consistency dimension and strong consistency dimension are equal.

Proposition 2.17.

If $\operatorname{C}(\mathcal{C},\mathcal{H})=1$ , then $\operatorname{SC}(\mathcal{C},\mathcal{H})=1$ . If $\operatorname{C}(\mathcal{C},\mathcal{H})=2$ , then $\operatorname{SC}(\mathcal{C},\mathcal{H})=2$ .

Proof.

Observe that $\operatorname{C}(\mathcal{C},\mathcal{H})=1$ iff $\operatorname{SC}(\mathcal{C},\mathcal{H})=1$ iff $\mathcal{H}$ shatters the set of all elements $x\in X$ such that there are $A_{0}$ and $A_{1}$ in $\mathcal{C}$ such that $x\notin A_{0}$ but $x\in A_{1}$ .

Suppose that $\operatorname{C}(\mathcal{C},\mathcal{H})=2$ . Let $A$ be a partially specified subset that is 2-consistent with $\mathcal{C}$ . We wish to find a total extension of $A$ in $\mathcal{H}$ . It suffices to find a total extension $B\sqsupseteq A$ that is 2-consistent with $\mathcal{C}$ .

Let $X=\{x_{\alpha}\,|\,\alpha<|X|\}$ be a well-ordering of $X$ . Let $A_{0}=A$ . We inductively define a $\sqsubseteq$ -chain of partially specified subsets $A_{\alpha}$ , where each $A_{\alpha}$ is defined on $\operatorname{dom}(A)\cup\{x_{\xi}\,|\,\xi<\alpha\}$ and is 2-consistent with $\mathcal{C}$ . For $\alpha$ a limit ordinal, set $A_{\alpha}=\cup_{\xi<\alpha}A_{\xi}$ . It is clear that $A_{\alpha}$ is 2-consistent with $\mathcal{C}$ if all $A_{\xi}$ for $\xi<\alpha$ are.

At any successor stage $\alpha+1$ , if $x_{\alpha}\in\operatorname{dom}(A_{\alpha})$ , set $A_{\alpha+1}=A_{\alpha}$ . Otherwise, we must extend $A_{\alpha}$ to $x_{\alpha}$ while remaining 2-consistent with $\mathcal{C}$ . Assume for contradiction that neither $B_{0}:=A_{\alpha}\cup\{x_{\alpha}\mapsto 0\}$ nor $B_{1}:=A_{\alpha}\cup\{x_{\alpha}\mapsto 1\}$ are 2-consistent with $\mathcal{C}$ . Then there are $y_{0}$ , $y_{1}\in\operatorname{dom}(A_{\alpha})$ such that $B_{0}|_{\{y_{0},x_{\alpha}\}}$ and $B_{1}|_{\{y_{1},x_{\alpha}\}}$ have no extension in $\mathcal{C}$ . But $A_{\alpha}|_{\{y_{0},y_{1}\}}$ has an extension $B$ in $\mathcal{C}$ , and $B$ must be an extension of either $B_{0}|_{\{y_{0},x_{\alpha}\}}$ or $B_{1}|_{\{y_{1},x_{\alpha}\}}$ , a contradiction. So $A_{\alpha}$ has a 2-consistent extension to $x_{\alpha}$ , and we set $A_{\alpha+1}$ to be such an extension.

We then take $\cup_{\xi<|X|}A_{\xi}$ to be our total extension. ∎

As the following examples show, consistency dimension and strong consistency dimension may differ when $\operatorname{C}(\mathcal{C},\mathcal{H})\geq 3$ .

Example 2.18.

Let $X=\{a,b,c,d,e\}$ . Let

[TABLE]

One can verify that $\operatorname{C}(\mathcal{C},\mathcal{H})=3$ , but the partially specified subset $\{a,b,c,d\}$ with $e$ unspecified witnesses that $\operatorname{SC}(\mathcal{C},\mathcal{H})>3$ .

Example 2.19.

Continuing Example 2.15, observe that $\operatorname{SC}(\mathcal{C},\mathcal{C})=c^{d}$ . In particular, the partially specified subset $A^{\prime}$ given by

[TABLE]

witnesses that $\operatorname{SC}(\mathcal{C},\mathcal{C})>c^{d}-1$ . Then we learn in at most $SC(\mathcal{C},\mathcal{C})$ many queries. Moreover, this demonstrates that consistency dimension and strong consistency dimension can differ by an arbitrarily large amount (allowing $\operatorname{Ldim}(\mathcal{C})$ to vary), and that strong consistency dimension may even be exponentially larger than consistency dimension.

Strong consistency dimension, like consistency dimension, categorizes equivalence query learning:

Theorem 2.20.

$\mathcal{C}$ * is learnable with equivalence queries from $\mathcal{H}$ iff $\operatorname{Ldim}(\mathcal{C})\leq\infty$ and $\operatorname{SC}(\mathcal{C},\mathcal{H})<\infty$ . In particular, $\operatorname{SC}(\mathcal{C},\mathcal{H})\leq\operatorname{LC}^{EQ}(\mathcal{C},\mathcal{H})$ .*

Proof.

For the reverse direction, use Theorem 2.6 and the observation that $\operatorname{C}(\mathcal{C},\mathcal{H})\leq\operatorname{SC}(\mathcal{C},\mathcal{H})$ .

For the forward direction, use Propositions 2.2 and 2.7. In particular, if $\operatorname{SC}(\mathcal{C})\geq c$ , then there is a partially specified subset $A$ that is $(c-1)$ -consistent with $\mathcal{C}$ but which has no total extension in $\mathcal{H}$ . Then, by Proposition 2.7, $c\leq\operatorname{LC}^{EQ}(\mathcal{C},\mathcal{H})$ . ∎

Corollary 2.21.

Suppose $\operatorname{Ldim}(\mathcal{C})<\infty$ . Then $\operatorname{C}(\mathcal{C},\mathcal{H})<\infty$ iff $\operatorname{SC}(\mathcal{C},\mathcal{H})<\infty$ .

The distinction between consistency dimension and strong consistency dimension is subtle, and many previous results hold with little to no modification if one replaces consistency dimension with strong consistency dimension. On the other hand, our work in section 3 will reveal the practical difficulties associated with with strong consistency dimension in complicated concept classes.

We have already seen in Theorem 2.20 that strong consistency dimension provides a better lower bound for learning complexity. It is also known in the finite case that strong consistency dimension also gives a stronger upper bound for learning complexity:

Theorem 2.22.

[8, Theorem 2]** Suppose $\mathcal{C}$ is finite. Then $\operatorname{LC}^{EQ}(\mathcal{C},\mathcal{H})\leq\lceil SC(\mathcal{C},\mathcal{H})\cdot\ln|\mathcal{C}|\rceil$ .

Proof.

As this was originally framed in the setting where concepts were represented by strings, we give an abbreviated translation of the original proof into the language of set systems. This proof demonstrates the utility of constructing a partial hypothesis and taking some complete extension.

Let $c=\operatorname{SC}(\mathcal{C},\mathcal{H})$ . At stage $i$ , let $\mathcal{C}_{i}\subseteq\mathcal{C}$ be the set of remaining possible target concepts. Let $A_{i}$ be the partially specified subset given by

[TABLE]

Observe that $A$ is $c$ -consistent with $\mathcal{C}$ —given any $Y:=\{x_{0},\ldots,x_{c-1}\}\subseteq\operatorname{dom}(A)$ , for each $j$ , less than $\frac{1}{c}|\mathcal{C}_{i}|$ many remaining concepts disagree with $A$ on $x_{j}$ , so less than $c\frac{1}{c}|\mathcal{C}_{i}|=|\mathcal{C}_{i}|$ many concepts disagree with $A$ on some $x_{j}$ . So some concept agrees with $A$ on $Y$ . So $A$ is $c$ -consistent.

So we can find some $B\in\mathcal{H}$ such that $B\sqsupseteq A$ , and we submit $B$ as our hypothesis. By choice of $A$ , if we receive a counterexample, we will have $|\mathcal{C}_{i+1}|\leq\frac{c-1}{c}|\mathcal{C}_{i}|$ . Repeating this $\lceil c\cdot\ln|\mathcal{C}|\rceil$ many times is enough to identify and submit the target concept. ∎

In light of Example 2.19, one hopes that improved bounds on learning can be found in terms of strong consistency dimension and Littlestone dimension when $\mathcal{C}$ is infinite. We are unable to show this presently, but offer some evidence in this direction:

Proposition 2.23.

Suppose $\operatorname{Ldim}(\mathcal{C})=d<\infty$ and $\operatorname{SC}(\mathcal{C},\mathcal{H})=2<\infty$ . Then $\operatorname{LC}^{EQ}(\mathcal{C},\mathcal{H})=d+1$ .

Proof.

We know by Proposition 2.2 that $d+1$ is a lower bound. We show that it is also an upper bound.

Let $V_{0}=\mathcal{C}$ . Inductively define $V_{i}$ , $i=1,\ldots,d$ as follows. Given $V_{i}$ , for any $x\in X$ and $j\in\{0,1\}$ , let

[TABLE]

where $\chi_{B}$ is the characteristic function on $B$ . Construct the partially specified subset $A_{i}$ where

[TABLE]

We claim that $A_{i}$ has an extension in $H$ . By our assumption that $\operatorname{SC}(\mathcal{C},\mathcal{H})=2$ , it suffices to check that $A$ is 2-consistent with $V_{i}$ . Suppose for contradiction that there are $a_{0},a_{1}\in\operatorname{dom}(A_{i})$ such that, without loss of generality, $A_{i}(a_{0})=A_{i}(a_{1})=0$ , but there is no extension of $A_{i}|_{\{a_{0},a_{1}\}}$ in $V_{i}$ . Then observe that $V_{i}^{(x_{0},0)}\subseteq V_{i}^{(x_{1},1)}$ , whence

[TABLE]

so $\operatorname{Ldim}(V_{i}^{(x_{1},1)})=\operatorname{Ldim}(V_{i})$ . But we also have $\operatorname{Ldim}(V_{i}^{(x_{1},0)})=\operatorname{Ldim}(V_{i})$ , a contradiction, as we could then construct a binary element tree with proper labels from $V_{i}$ of height $\operatorname{Ldim}(V_{i})+1$ with $x_{1}$ at the root.

Let $B_{i}\in\mathcal{H}$ be a total extension of $A_{i}$ . Submit $B_{i}$ as the hypothesis. If $B_{i}$ is correct, we are done. Otherwise, we receive a counterexample $x_{i}$ . Set

[TABLE]

Observe that at each stage, $\operatorname{Ldim}(V_{i+1})<\operatorname{Ldim}(V_{i})$ . Therefore, if we make $d$ queries without correctly identifying the target, then we must have $\operatorname{Ldim}(V_{d})=0$ . Then $V_{d}$ is a singleton, which must be the target concept.

∎

The proof of Proposition 2.23 uses strong consistency in a key way, as the hypothesis is generated by extending a certain partially specified subset. Nevertheless, the conclusion holds under the assumption that $C(\mathcal{C},\mathcal{H})=2$ , due to Proposition 2.17.

2.4. Adding membership queries and efficient learning of finite classes

Consistency dimension was originally derived from the notion of polynomial certificates, which was used to characterize learning with equivalence and membership queries in the finite case by [15]. The following is an improvement of the upper bound on EQ+MQ learning complexity of $\lceil\operatorname{C}(\mathcal{C},\mathcal{H})\log_{2}|\mathcal{C}|\rceil$ implicit in the proof of Theorem 3.1.1 in [15] (stated explicitly in [8]). Our bound replaces $\log_{2}|\mathcal{C}|$ with $\operatorname{Ldim}(\mathcal{C})$ .

Theorem 2.24.

Suppose $\operatorname{Ldim}(\mathcal{C})=d<\infty$ and $\operatorname{C}(\mathcal{C},\mathcal{H})=c<\infty$ . Then $\operatorname{LC}^{EQ+MQ}(\mathcal{C},\mathcal{H})\leq c^{\prime}d+1$ , where $c^{\prime}=\max\{1,c-1\}$ .

Proof.

444The algorithm is similar to that of Theorem 2.6. However, the applications of Lemma 2.5 are replaced with membership queries.

We proceed by induction on $d$ . The base case, $d=0$ , is trivial, as then $\mathcal{C}$ is a singleton. Suppose there is some element $x$ such that $\operatorname{Ldim}(\mathcal{C}\cap x)<d+1$ and $\operatorname{Ldim}(\mathcal{C}\setminus x)<d+1$ , where $\mathcal{C}\cap x:=\{A\in\mathcal{C}\,|\,x\in A\}$ and $\mathcal{C}\setminus x:=\{A\in\mathcal{C}\,|\,x\notin A\}$ . Then by induction, any concept in $\mathcal{C}\cap x$ can be learned in at most $c^{\prime}d+1$ queries with guesses from $\mathcal{H}$ , and the same is true for $\mathcal{C}\setminus x$ . Submit $x$ as a membership query. This tells us whether the target concept lies in $\mathcal{C}\cap x$ or $\mathcal{C}\setminus x$ , and then we require at most $c^{\prime}d+1$ many queries, for a total of $c^{\prime}d+2\leq c^{\prime}(d+1)+1$ many queries.

If no such $x$ exists, then for all $x$ , either $\operatorname{Ldim}(\mathcal{C}\cap x)=d+1$ or $\operatorname{Ldim}(\mathcal{C}\setminus x)=d+1$ . Let $B$ be such that $x\in B$ iff $\operatorname{Ldim}(\mathcal{C}\cap x)=d+1$ .

If $B\in\mathcal{H}$ , then we submit $B$ as our query. If we are incorrect, then by choice of $B$ , the class $\mathcal{C}^{\prime}$ of concepts consistent with the counterexample $x_{0}$ will have Littlestone dimension $\leq d$ . By induction, any concept in $\mathcal{C}^{\prime}$ can be learned in at most $c^{\prime}d+1$ many queries, and so we learn the target in at most $c^{\prime}d+2\leq c^{\prime}(d+1)+1$ queries.

If $B\notin H$ , then, since $\operatorname{C}(\mathcal{C},\mathcal{H})=c$ , there are some $x_{0},\ldots,x_{c-1}$ such that there is no $A\in\mathcal{C}$ such that $B|_{\{x_{0},\ldots,x_{c-1}\}}\sqsubseteq A$ . (Observe that this cannot happen when $c=1$ . In fact, Proposition 2.17 and the proof of Proposition 2.23 imply that this cannot even happen when $c=2$ . In particular, $c^{\prime}=c-1$ .) Then, with notation as in the proof of Proposition 2.2,

[TABLE]

and $\operatorname{Ldim}(\mathcal{C}^{(x_{i},1-B(x_{i}))})\leq d$ for each $i$ . By induction, any concept in each $\mathcal{C}^{(x_{i},1-B(x_{i}))}$ can be learned in at most $c^{\prime}d+1$ many queries. By submitting $x_{0},\ldots,x_{c-2}$ as membership queries, we can determine some $i$ such that the target belongs to $\mathcal{C}^{(x_{i},1-B(x_{i}))}$ (if the result of each membership query on $x_{j}$ is $B(x_{j})$ , then we know that $i=c-1$ ). We therefore learn in at most $c^{\prime}d+1+(c-1)=c^{\prime}(d+1)+1$ many queries. ∎

We have a lower bound on learning complexity in terms of consistency dimension in this setting analogous to Proposition 2.7:

Proposition 2.25.

Suppose there is some (total) subset $A$ which is $n$ -consistent with $\mathcal{C}$ but which does not have a total extension in $\mathcal{H}$ . Then $n<\operatorname{LC}^{EQ+MQ}(\mathcal{C},\mathcal{H})$ . In particular, $\operatorname{C}(\mathcal{C},\mathcal{H})\leq\operatorname{LC}^{EQ+MQ}(\mathcal{C},\mathcal{H})$ .

Proof.

We first show that $n<\operatorname{LC}^{EQ+MQ}(\mathcal{C},\mathcal{H})$ . If the learner submits $x$ as a membership query, the teacher returns $A(x)$ if possible, that is, if there is a concept $B\in\mathcal{C}$ which agrees with the previous data and satisfies $B(x)=A(x)$ .

By hypothesis, given any equivalence query $H$ , the teacher can find some $x\in\operatorname{dom}(A)$ such that $H(x)\neq A(x)$ , and the teacher returns a counterexample of this form if possible, that is, if there is a concept $B\in\mathcal{C}$ which agrees with the previous data and satisfies $B(x)=A(x)$ .

Moreover, since $A$ is $n$ -consistent with $\mathcal{C}$ , the teacher is able to return data of this form for the first $n$ queries. Thus $\mathcal{C}$ cannot be learned with fewer than $n+1$ equivalence queries from $\mathcal{H}$ .

From this, it follows that $\operatorname{C}(\mathcal{C},\mathcal{H})\leq\operatorname{LC}^{EQ+MQ}(\mathcal{C},\mathcal{H})$ . ∎

Finally, putting together the various upper and lower bounds from this section we give a characterization of those problems efficiently learnable by equivalence and membership queries:

Theorem 2.26.

Let $(\mathcal{C}_{n},\mathcal{H}_{n})_{n\in\mathbb{N}}$ be a family of concept classes and hypothesis classes, respectively. Let $c_{n}=\operatorname{C}(\mathcal{C}_{n},\mathcal{H}_{n}).$ Let $d_{n}=\operatorname{Ldim}(\mathcal{C}_{n}).$ The following are equivalent:

**(i): **

$\operatorname{LC}^{EQ+MQ}(\mathcal{C}_{n},\mathcal{H}_{n})$ * is bounded by a polynomial in $n$ .*

**(ii): **

$c_{n}$ * and $d_{n}$ are bounded by a polynomial in $n$ .*

**(iii): **

The algorithm from Theorem 2.24 learns $\mathcal{C}_{n}$ in at most polynomially in $n$ many membership queries and equivalence queries in $\mathcal{H}_{n}$ .

Proof.

(ii) $\Rightarrow$ (iii) follows immediately from Theorem 2.24, and (iii) $\Rightarrow$ (i) follows by definition of learning complexity.

(i) $\Rightarrow$ (ii): In Proposition 2.25, we showed that $\operatorname{LC}^{EQ+MQ}(\mathcal{C},\mathcal{H})\geq\operatorname{C}(\mathcal{C},\mathcal{H}),$ so it follows that if $c_{n}$ is not polynomially bounded then neither is $\operatorname{LC}^{EQ+MQ}(\mathcal{C}_{n},\mathcal{H}_{n})$ .

Now suppose that $d_{n}$ is not polynomially bounded. By [7, Theorem 2.1] 555The inequality of [7] gives a lower bound for $\operatorname{LC}^{EQ+MQ}$ which improved on the lower bound of $\frac{\operatorname{LC}^{EQ}(\mathcal{C},\mathcal{P}(X))}{\log(1+\operatorname{LC}^{EQ}(\mathcal{C},\mathcal{P}(X)))}$ from [20, Theorem 3]. In fact, Theorem 3 of [20] actually suffices for our purposes. we have

[TABLE]

By [18, Theorems 5 and 6], we can replace $\operatorname{LC}^{EQ}(\mathcal{C},\mathcal{P}(X))$ with $\operatorname{Ldim}(\mathcal{C}).$ Thus:

[TABLE]

from which it follows that $\operatorname{LC}^{EQ+MQ}(\mathcal{C}_{n},\mathcal{H}_{n})$ is not polynomially bounded. ∎

Finally, the upper and lower bounds of this section also yield a characterization of which infinite classes are learnable in finitely many equivalence and membership queries.

Corollary 2.27.

$\operatorname{LC}^{EQ+MQ}(\mathcal{C},\mathcal{H})<\infty$ * iff $\operatorname{Ldim}(\mathcal{C})<\infty$ and $\operatorname{C}(\mathcal{C},\mathcal{H})<\infty$ .*

2.5. The negation of the finite cover property

One can compare set systems with finite strong consistency dimension, to the model-theoretic classes of formulas and theories without the finite cover property, which we define below. Informally, the negation of the finite cover property allows for a specific quantitative bound for applications of compactness.

Definition 2.28.

Fix a first order theory $T$ . A formula $\phi(x;y)$ in the language of $T$ does not have the finite cover property (ncfp) if there is $n=n(\phi)$ such that for all $\mathcal{M}\models T$ , and every $p\subseteq\{\phi(x;a),\neg\phi(x;a)\,|\,a\in M\}$ , the following holds: if every $q\subseteq p$ of size $n$ is consistent, then $p$ is consistent. We let $n(\phi)$ denote the minimal such $n$ .

$T$ does not have the finite cover property if all formulas $\phi(x;y)$ do not have the finite cover property.666The definition of nfcp on the formula level given here is stronger than original formulation in [26], but it gives an equivalent characterization on the level of theories.

Consider the setting where $\mathcal{C}$ is generated by a formula $\phi(x;y)$ , that is, $\mathcal{M}\models T$ and

[TABLE]

That is, $\mathcal{C}_{\phi}$ consists of the $\phi$ -definable sets. Suppose $\phi^{opp}(y;x)=\phi(x;y)$ does not have the finite cover property, witnessed by some $n=n(\phi^{opp})$ . Then, given any disjoint $A_{0},A_{1}\subseteq X$ , if every size $n$ subset of

[TABLE]

is consistent, then $p(y)$ is consistent. We can identify this partial type with the partially specified subset $A$ where

[TABLE]

By passing to an $|M|^{+}$ -saturated extension $\mathcal{N}\succ\mathcal{M}$ to obtain a larger parameter set, we can find $b^{\prime}\in\mathcal{N}$ satisfying $p(y)$ . Then $\phi(M,b^{\prime})$ is a total extension of $A$ .

Supposing we have passed to an $|\mathcal{M}|^{+}$ saturated extension $\mathcal{N}$ , we can let

[TABLE]

That is, $\mathcal{H}_{\phi}$ consists of all externally $\phi$ -definable subsets of $M$ , as $N$ contains realizations of all consistent partial $\phi^{opp}$ -types over $M$ . By the compactness theorem, this means that $N$ contains realizations of all finitely consistent partial $\phi^{opp}$ -types over $M$ . Having identified partially specified subsets of $M$ with their corresponding $\phi^{opp}$ -type, this amounts to observing that $\mathcal{H}_{\phi}$ contains total extensions of all finitely consistent partially specified subsets, equivalently, contains all finitely consistent total subsets.

This gives a model-theoretic motivation to the strategy suggested by Proposition 2.12. Adding all finitely consistent subsets to $\mathcal{H}$ amounts to saturating $\mathcal{N}$ so as to realize all $\phi^{opp}$ -types over $\mathcal{M}$ .

If $\phi^{opp}$ has nfcp with $n(\phi)=n$ , then the finitely consistent partial types are exactly the $n$ -consistent types. Then $\mathcal{H}_{\phi}$ contains total extensions of all $n$ -consistent partially specified subsets, so $SC(\mathcal{C}_{\phi},\mathcal{H}_{\phi})=n$ . Note that $\phi^{opp}$ -types witnessing that $n$ is the minimal such $n$ at which $\phi^{opp}$ has nfcp give partially specified subsets witnessing that $\operatorname{SC}(\mathcal{C}_{\phi},\mathcal{H}_{\phi})\not<n$ . This reflects a variant of Proposition 2.12 for strong consistency dimension.

In particular, formulas $\phi$ such that $\phi^{opp}$ has nfcp provide a rich family of examples where $\mathcal{C}_{\phi}$ has finite (strong) consistency threshold. That is, for such $\phi$ , it is necessary and sufficient for $\mathcal{H}$ to contain all externally $\phi$ -definable subsets (that is, all total finitely consistent partially specified subsets) to obtain $\operatorname{SC}(\mathcal{C},\mathcal{H})<\infty$ . On the other hand, when $\phi^{opp}$ has the finite cover property, the externally definable sets are no longer sufficient, and one must venture beyond the sets $\phi$ is capable of cutting out to obtain $\operatorname{SC}(\mathcal{C},\mathcal{H})<\infty$ (that is, by adding some sets which are inconsistent).

Furthermore, Littlestone dimension of $\phi(x;y)$ (that is, the Littlestone dimension of $\mathcal{C}_{\phi}$ ) is expressible as a first-order property. So we will have $\operatorname{Ldim}(\mathcal{C})=\operatorname{Ldim}(\mathcal{H})$ . So when the context is a set system $\mathcal{C}$ generated by a stable formula $\phi(x;y)$ with $\phi^{opp}(y;x)$ nfcp, we can obtain a set system $\mathcal{H}$ such that $\operatorname{SC}(\mathcal{C},\mathcal{H})<\infty$ , but $\mathcal{H}$ is not much more complicated than the original set system - $\mathcal{H}$ has the same Littlestone dimension as $\mathcal{C}$ . This is essentially the content of Theorem 2.14 when $\mathcal{C}$ has finite consistency threshold.

We give an example from model theory where $\phi^{opp}$ has the fcp.

Example 2.29.

Let $\mathcal{M}$ be a structure in the language $\{E\}$ , where $E$ is an equivalence relation with one class of size $n$ for each $n\in\mathbb{N}$ , possibly with some infinite classes. Let

[TABLE]

and let $\mathcal{C}=\mathcal{C}_{\phi}$ .

Suppose $a_{1},\ldots,a_{d}$ are the elements belonging to the equivalence class of size $d$ . Then the $\phi^{opp}$ -type $\{\phi(a_{i},y)\,|\,i\leq d\}$ $(d-1)$ -consistent but $d$ inconsistent. Since there are equivalence classes of arbitrarily large size, these witness that $\phi^{opp}$ is not nfcp. One can check that $\operatorname{Ldim}(\mathcal{C}_{\phi})=2$ .

In any $|M|^{+}$ -saturated elementary extension $\mathcal{N}$ of $\mathcal{M}$ , no additional elements are added to the finite equivalence classes already present in $\mathcal{M}$ , though $\mathcal{N}$ adds new infinite classes and new elements to any existing infinite classes.

An attempt to learn $\mathcal{C}_{\phi}$ by equivalence queries following the strategy of Theorem 2.6 would be as follows. We are attempting to identify some $c\in M$ . Letting $a_{0}$ be an element in a new infinite equivalence class in $\mathcal{N}$ , we guess $\phi(M,a_{0})=\emptyset$ . Then any counterexample will identify an element belonging to the equivalence class of $c$ . If $c$ belongs to an infinite class, then we can find some $a_{1}\in N$ which is a new element of this class. Then $\phi(M,a_{1})=\{b\in M\,|\,E(b,c)\}$ . Then $c$ is the only available counterexample, and we submit the correct concept $\phi(M,c)$ at our next turn. However, if $c$ belongs to the finite class of size $n$ , then $N$ has no new elements in this class. Then the relevant queries, which are of the form $\phi(M,a)$ for $a$ in the class of $c$ , are already present in $\mathcal{C}_{\phi}$ . Then we are essentially attempting to identify a singleton from a set of size $n$ , and it is clear that the process could take up to $n$ additional guesses.

3. Efficient learnability of regular languages

In a seminal paper, [1] showed that regular languages are efficiently learnable with equivalence queries plus membership queries, and in this subsection, we will use Theorem 2.24 to give an alternate short proof of this fact.777In the following sections, we only make use of proper equivalence queries, that is, $\mathcal{H}=\mathcal{C}$ . We shall therefore let $\operatorname{C}(\mathcal{C}):=\operatorname{C}(\mathcal{C},\mathcal{C})$ , which we will call the consistency dimension of $\mathcal{C}$ (with analogous notation for strong consistency dimension). Let $\mathcal{L}_{n,m}$ be the class of binary regular languages on strings of length at most $m$ specified by a deterministic finite automaton on at most $n$ nodes. The $\mathcal{L}^{*}$ algorithm of [1] specifically uses $\mathcal{O}(n)$ equivalence queries and $\mathcal{O}(mn^{2})$ membership queries. We let $DFA_{2}(n)$ denote the collection of (equivalence classes of) deterministic finite automata accepting binary strings and having at most $n$ nodes. The proof of the next proposition is straightforward.

Proposition 3.1.

The Littlestone dimension of $DFA_{2}(n)$ is at most $o(1)(n\log n)$ .

Proof.

In [16, Proposition 1], it is shown that $|DFA_{2}(n)|\leq\frac{n^{2n}2^{n}n}{n!}\leq 2^{o(1)(n\log n)}.$ From this, it follows that the Littlestone dimension of $DFA_{2}(n)$ is at most $o(1)(n\log n)$ . ∎

The proof of the following proposition reveals the connection between consistency and the Myhill-Nerode theorem.

Proposition 3.2.

$\operatorname{C}(DFA_{2}(n))\leq 2\binom{n+1}{2}=n(n+1)$ .

Proof.

Fix a subset $C$ of binary strings and $x,y$ binary strings. We say that $z$ is a ( $C$ -) distinguishing extension of $x$ and $y$ if $xz\in C$ but $yz\notin C$ or vice versa. If $x$ and $y$ have no distinguishing extension, then we say $x$ and $y$ are $C$ -equivalent, and write $x\sim_{C}y$ . The Myhill-Nerode theorem [23] says that a subset of binary strings of length $m$ is the accept set of a finite automaton with at most $n$ nodes if and only if the number of $\sim_{C}$ classes is at most $n$ . Thus, given any subset $C$ of the binary strings of length $m$ which is not a regular language recognized by an automaton with at most $n$ nodes, there are at least $n+1$ $\sim_{C}$ -classes of elements. Pick representatives $x_{0},\ldots,x_{n}$ from $n+1$ classes, and for each $i<j$ , pick some $z_{ij}$ that is a distinguishing extension of $x_{i}$ and $x_{j}$ . Then restricting $C$ to the partial assignment on $\{x_{k}z_{ij}\,|\,i<j,\,k=i,j\}$ , a domain of size $2\binom{n+1}{2}=n(n+1)$ that witnesses that $x_{i}\not\sim_{C}x_{j}$ for all $i\neq j$ , we can see that this restriction is inconsistent with the class of regular languages recognized by automata with at most $n$ nodes. Therefore $\operatorname{C}(DFA_{2}(n))\leq n(n+1)$ . 888Note that the same proof shows that the consistency dimension of $DFA_{m}(n)$ is also at most $n(n+1)$ . ∎

Now, by Theorem 2.24 and the previous two results, it follows that:

Theorem 3.3.

The class $\mathcal{L}_{n,m}$ is learnable in at most $o(1)n\log n$ equivalence queries and at most $o(1)\left(n\log n\right)(n(n+1))$ membership queries.

It is interesting to note that contrary to $\mathcal{L}^{*}$ , when using the algorithm from Theorem 2.24, there is no dependence on $m$ , the length of the binary strings which the teacher is allowed to provide as counterexamples999We should also note that $\mathcal{L}^{*}$ was improved by Schapire to give a better bound on membership queries (still depending on $m$ ). [25]..

Theorem 2.6 now implies that $\mathcal{L}_{n,m}$ is learnable in at most $(n(n+1))^{(o(1)n\log n)}$ equivalence queries. Theorem 2.22 shows that a finite class $\mathcal{C}$ is learnable in at most $\lceil\operatorname{SC}(\mathcal{C})\cdot\ln|\mathcal{C}|\rceil$ equivalence queries. Since [3] showed that $\mathcal{L}_{n,m}$ is not learnable in polynomially many equivalence queries, it follows that $\operatorname{SC}(\mathcal{L}_{n,m})$ cannot be polynomial in $n,m$ .

3.1. Learning $\omega$ -languages

In this section, we consider the natural extension to languages on infinite strings indexed by $\omega$ , called $\omega$ -languages. For an alphabet $\Sigma$ , we denote by $\Sigma^{\omega}$ , the strings of symbols from $\Sigma$ of order type $\omega$ . Similar to the previous section, we consider an automaton, which consists of the collection $\mathbb{A}=(\Sigma,Q,q_{0},\delta),$ where $Q$ is a finite collection of states, $q_{0}$ is the initial state, and $\delta:Q\times\Sigma\rightarrow 2^{Q}$ is a transition rule. To form a language, an automaton is equipped with an acceptance criterion.101010Numerous acceptance criteria have been extensively studied in the literature, and we refer the reader to [5, 12, 11] for overviews. Fix a subset $F\subseteq Q$ . A run of a Büchi automaton is accepting if and only if it visits the set $F$ infinitely often. An $\omega$ -language is $\omega$ -regular if it is recognized by a non-deterministic Büchi automaton. A run of a co-Büchi automaton is accepting if and only if it visits $F$ only finitely often. Let $\psi:Q\rightarrow\{1,\ldots,k\}$ be a function, which we think of as a coloring of the states of the automaton. Let $c$ be the minimum color which is visited infinitely often. A run of a parity automaton is accepting if and only if $c$ is odd.

Two $\omega$ -regular languages are equivalent if they agree on the set of periodic words [21], which allows for the possibility of recognizing the $\omega$ -language using finitary automata. This is the approach of [5, 12], whose notation we follow closely. A family of DFAs (FDFA) $\mathcal{F}$ is a pair $(Q,P)$ where $Q$ is a DFA with $|Q|$ states and $P$ is a collection of $|Q|$ many DFAs, which we refer to as progress DFAs - one DFA $P_{q}$ for each state $q$ of $Q$ . Given a pair of finite words, $(u,v)$ , a run of our family of DFAs consists of running $Q$ on $u$ , then running $P_{Q(u)}$ on $v$ where $Q(u)$ is the ending state of $Q$ on $u$ . The pair $(u,v)$ can be used to represent an infinite periodic word $uv^{\omega}$ .

Let $FDFA(n,m)$ be the class of families of deterministic finite automata where the leading automaton has at most $n$ nodes and the progress automata each have at most $m$ nodes. It is not quite true that once an $\omega$ -regular language has been reduced to an FDFA that one can use $\mathcal{L}^{*}$ directly to learn the various DFAs in the family [5, section 4]. It is also not completely obvious what the bounds for Littlestone and consistency dimension are in terms of the DFAs in the family, but the next two results give such bounds which imply the efficient learnability of $\omega$ -regular languages.

Proposition 3.4.

The class $FDFA(n,m)$ has Littlestone dimension at most $o(1)(n\log n+nm\log m).$

Proof.

The number of FDFAs of size $(n,m)$ is clearly at most $|DFA_{2}(n)|\cdot|DFA_{2}(m)|^{n}.$ That is

[TABLE]

It follows that

[TABLE]

and using [16, Proposition 1], the desired bound follows. ∎

Proposition 3.5.

$\operatorname{C}(FDFA(n,m))\leq 2\binom{n(m+1)}{2}=\mathcal{O}(n^{2}m^{2})$ .

Proof.

A run of an FDFA on $(u,v)$ can be simulated by the run of an appropriate automaton in the class $DFA_{3}(n\cdot(m+1)).$ To see this, input word $u\$ v $where$ $ $is a new symbol (recall we are assuming$ u,v $are binary) to a DFA which has the same diagram as the FDFA but with an edge labeled with$ $ $from each state of the leading automaton to the initial state of the corresponding progress DFA. Now it follows by Proposition [3.2](#S3.Thmthm2) that the consistency dimension of$ FDFA(n,m) $is at most$ 2\binom{n(m+1)}{2}.$ ∎

Using the previous two results together with Theorem 2.24, one can deduce the efficient learnability of $FDFA(n,m)$ :

Theorem 3.6.

The class $FDFA(n,m)$ is learnable in at most $o(1)(n\log n+n\cdot m\log m)$ equivalence queries and at most $o(1)(\log n+m\log m)\cdot n^{3}m^{2}$ membership queries.

We have formulated our bounds in terms of the number of states in the FDFA corresponding to a given $\omega$ -language. In [5, 12] bounds on the number of states of FDFAs in terms of the number of states of automata for $\omega$ -languages with various acceptors are given. Specifically, the following bounds hold:

(1)

When $\mathcal{A}$ is a deterministic Büchi (DBA) or co-Büchi (DCA) automaton with $n$ states, there is an equivalent FDFA of size at most $(n,2n)$ [12, 5.3]. 2. (2)

When $\mathcal{A}$ is a deterministic partiy automaton (DPA) with $n$ states and $k$ colors, there is an equivalent FDFA of size at most $(n,kn)$ [12, 5.4]. 3. (3)

When $\mathcal{A}$ is an nondeterministic Büchi automaton (NBA) with $n$ states, there is an equivalent FDFA of size at most $(2^{\mathcal{O}(n\log n)},2^{\mathcal{O}(n\log n)})$ .

Any NBA can be translated into a DPA, and so 2) yields the efficient learnability of $\omega$ -regular languages in terms of the number of states in a DPA (this translation also yields 3). However, the translation from NBA to DPA is known to require an exponential increase in the number of states in general [24]. From an FDFA of size at most $(n,k)$ there is a translation into an NBA with at most $\mathcal{O}(n^{2}k^{3})$ states [12, Theorem 5.8], and so it follows that the exponential increase in states in moving from NBAs to FDFAs is necessary [12, Theorem 5.6].

Finally, we mention that [6] define restricted classes of $\omega$ -languages for which right-congruence is fully informative, and isolate numerous classes (e.g. for each type of acceptor from the previous subsection) of $\omega$ -languages for which an infinitary invariant of the Myhill-Nerode theorem holds. This variant of Myhill-Nerode is sufficient to bound the consistency dimension (and thus establish the learnability) of the classes in terms of the number of of right equivalence classes of $\sim_{\mathcal{L}}$ similar to the proof of Proposition 3.2.

4. Random counterexamples and EQ-learning

In section 2 we characterized learnability by equivalence queries in terms of Littlestone dimension and strong consistency dimension. The setting of equivalence query learning [2] as described in section 2 deals with worst-case bounds for algorithmic identification of concepts by a learner. In this section, we follow [4] and analyze a slightly different situation, in which the teacher selects the counterexamples at random, and we seek to bound the expected number of queries. [4] worked specifically with concept classes coming from boolean matrices, which was convenient for their notation. Our formulation is equivalent, but we use slightly different notation.

Throughout this section, let $X$ be a finite set, let $\mathcal{C}$ be a set system on $X$ , and let $\mu$ be a probability measure on $X$ . For $A,B\in\mathcal{C}$ , let $\Delta(A,B)=\{x\in X\,|\,A(x)\neq B(x)\}$ denote the symmetric difference of $A$ and $B$ .

Definition 4.1.

We denote, by $\mathcal{C}_{\bar{x}=\bar{i}}$ for $\bar{x}\in X^{n}$ and $\bar{i}\in\{0,1\}^{n}$ , the set system $\{A\in\mathcal{C}\,|\,A(x_{j})=i_{j},\,j=1,\ldots,n\}.$ For $A\in\mathcal{C}$ and $a\in X$ , we let

[TABLE]

For any $a\in X,$ either $\mathcal{C}_{a=1}$ or $\mathcal{C}_{a=0}$ has Littlestone dimension strictly less than that of $\mathcal{C}$ and so:

Lemma 4.2.

For $A,B\in\mathcal{C}$ and $a\in X$ with $A(a)\neq B(a),$

[TABLE]

Next, we define a directed graph which is similar to the elimination graph of [4].

Definition 4.3.

We define the thicket query graph $G_{TQ}(\mathcal{C},\mu)$ to be the weighted directed graph on vertex set $\mathcal{C}$ such that the directed edge from $A$ to $B$ has weight $d(A,B)$ equal to the expected value of $\operatorname{Ldim}(\mathcal{C})-\operatorname{Ldim}(\mathcal{C}_{x=B(x)})$ over $x\in\Delta(A,B)$ with respect to the distribution $\mu|_{\Delta(A,B)}.$ 111111Here one should think of the query by the learner as being $A$ , and the actual hypothesis being $B$ . The teacher samples from $\Delta(A,B)$ , and the learner now knows the value of the hypothesis on $x$ .

Definition 4.4.

The query rank of $A\in\mathcal{C}$ is defined as: $\inf_{B\in\mathcal{C}}(d(A,B)).$

Lemma 4.5.

For any $A\neq B\in\mathcal{C}$ , $d(A,B)+d(B,A)\geq 1.$

Proof.

Noting that $\Delta(A,B)=\Delta(B,A),$ and using Lemma 4.2:

[TABLE]

∎

Definition 4.6.

[4, Definition 14] Let $G$ be a weighted directed graph and $l\in\mathbb{N},\,l>1.$ A deficient $l$ -cycle in $G$ is a sequence $v_{0},\ldots v_{l-1}$ of distinct vertices such that for all $i\in[l]$ , $d(v_{i},v_{(i+1)\,(\mod l)})\leq\frac{1}{2}$ with strict inequality for at least one $i\in[l]$ .

The next result is similar to Theorems 16 (the case $l=3$ ) and Theorem 17 (the case $l>3$ ) of [4], but our proof is rather different (note that the case $l=2$ follows easily from Lemma 4.5).

Theorem 4.7.

The thicket query graph $G_{TQ}(\mathcal{C},\mu)$ has no degenerate $l$ -cycles for $l\geq 2.$

The analogue of Theorem 16 can be adapted in a very similar manner to the technique employed by [4]. However, the analogue of the proof of Theorem 17 falls apart in our context; the reason is that Lemma 4.2 is analogous to Lemma 6 of [4] (and Lemma 4.5 is analogous to Lemma 13 of [4]), but our lemmas involve inequalities instead of equations. The inductive technique of [4, Theorem 17] is to shorten degenerate cycles by considering the weights of a particular edge in the elimination graph along with the weight of the edge in the opposite direction. Since one of those weights being large forces the other to be small (by the equalities of their lemmas), the induction naturally separates into two useful cases. In our thicket query graph, things are much less tightly constrained - one weight of an edge being large does not force the weight of the edge in the opposite direction to be small. However, the technique employed in our proof seems to be flexible enough to adapt to prove Theorems 16 and 17 of [4].

Proof.

Suppose the vertices in the degenerate $l$ -cycle are $A_{0},\ldots,A_{l-1}$ .

By the definition of degenerate cycles and $d(-,-),$ we have, for each $i\in\mathbb{Z}/l\mathbb{Z}$ , that

[TABLE]

so clearing the denominator we have

[TABLE]

Note that throughout this argument, the coefficients are being calculated modulo $l$ . Notice that for at least one value of $i$ , the inequality in 4.1 must be strict.

Let $G,H$ be a partition of

[TABLE]

Now define

[TABLE]

The following fact follows from the definition of $\Delta(A,B)$ and $D(-,-)$ .

Fact 4.8.

The set $\Delta(A_{i},A_{i+1})$ is the disjoint union, over all partitions of $\mathcal{X}$ into two pieces $G,H$ such that $A_{i}\in G$ and $A_{i+1}\in H$ of the sets $D(G,H).$

Now, take the sum of the inequalities 4.1 as $i$ ranges from $1$ to $l$ . On the LHS of the resulting sum, we obtain

[TABLE]

On the RHS of the resulting sum we obtain

[TABLE]

Given a partition $G,H$ of $\{A_{1},\ldots,A_{l}\}$ we note that the term $D(G,H)=D(H,G)$ appears exactly once as an element of the above sum for a fixed value of $i$ exactly when $A_{i}\in G$ and $A_{i+1}\in H$ or $A_{i}\in H$ and $A_{i+1}\in G.$

Consider the partition $G,H$ of $\mathcal{X}$ . Suppose that $A_{j},A_{j+1},\ldots,A_{k}$ is a block of elements each contained in $G$ , and that $A_{j-1},A_{k+1}$ are in $H$ . Now consider the terms $i=j-1$ and $i=k$ of the above sums (each of which where $D(G,H)$ appears).

On the left hand side, we have $\sum_{a\in D(G,H)}\mu(a)u(A_{j-1},a))$ and $\sum_{a\in D(G,H)}\mu(a)u(A_{k},a))$ . Note that for $a\in D(G,H)$ , we have $a\in\Delta(A_{j-1},A_{k}).$ So, by Lemma 4.2, we have

[TABLE]

On the RHS, we have

[TABLE]

For each $G,H$ a partition of $X$ , the terms appearing in the above sum occur in pairs as above by Fact 4.8, and so, we have the the LHS is at least as large as the RHS of the sum of inequalities 4.1, which is impossible, since one of the inequalities must have been strict by our degenerate cycle. ∎

Theorem 4.9.

There is at least one element $A\in\mathcal{C}$ with query rank at least $\frac{1}{2}$ .

Proof.

If not, then for every element $A\in\mathcal{C}$ , there is some element $B\in\mathcal{C}$ such that $d(A,B)<\frac{1}{2}$ . So, pick, for each $A\in\mathcal{C}$ , an element $f(A)$ such that $d(A,f(A))<\frac{1}{2}.$ Now, fix $A\in\mathcal{C}$ and consider the sequence of elements of $\mathcal{C}$ given by $(f^{i}(A))$ ; since $\mathcal{C}$ is finite, at some point the sequence repeats itself. So, take a list of elements $B,f(B),\ldots,f^{n}(B)=B$ . By construction, this yields a bad cycle, contradicting Theorem 4.7. ∎

4.1. The thicket max-min algorithm

In this subsection we show how to use the lower bound on query rank proved in Theorem 4.9 to give an algorithm which yields the correct concept in linearly (in the Littlestone dimension) many queries from $\mathcal{C}$ . The approach is fairly straightforward—essentially the learner repeatedly queries the highest query rank concept. The approach is similar to that taken in [4, Section 5] but with query rank in place of their notion of informative.

Now we informally describe the thicket max-min-algorithm. At stage $i$ , the learner is given information of a concept class $\mathcal{C}_{i}.$ The learner picks the query

[TABLE]

The algorithm halts if the learner has picked the actual concept $C$ . If not, the teacher returns a random element $a_{i}\in\Delta(A,C)$ at which point the learner knows the value of $C(a_{i}).$ Then

[TABLE]

Let $T(\mathcal{C})$ be the expected number of queries before the learner correctly identifies the target concept.

Theorem 4.10.

The expected number of queries to learn a concept in a class $\mathcal{C}$ is less than or equal to $2\operatorname{Ldim}(\mathcal{C}).$

Proof.

The expected drop in the Littlestone dimension of the concept class induced by any query before the algorithm terminates is at least $\frac{1}{2}$ by Theorem 4.9; so the probability that the drop in the Littlestone dimension is positive is at least $\frac{1}{2}$ for any given query. So, from $2n$ queries, one expects at least $n$ drops in Littlestone dimension. ∎

We give a rough bound on the probability that the algorithm has not terminated after a certain number of queries. Since a query can reduce the Littlestone dimension of the induced concept class by at most $\operatorname{Ldim}(\mathcal{C})$ and the expected drop is at least $\frac{1}{2}$ , the probability that a query reduces the Littlestone dimension is at least $\frac{1}{2\operatorname{Ldim}(\mathcal{C})}$ . Then the probability that the Littlestone dimension of the induced concept class after $n$ queries is positive is at most the probability of fewer than $\operatorname{Ldim}(\mathcal{C})$ many successes in the binomial distribution with probability $\frac{1}{2\operatorname{Ldim}(\mathcal{C})}$ and $n$ trials. It follows by Hoeffding’s inequality that the probability that the algorithm has not terminated after $n$ steps is at most

[TABLE]

5. Compression schemes and stability

In this section, we follow the notation and definitions given in [14] on compression schemes, a notion due to Littlestone and Warmuth [19]. Roughly speaking, $\mathcal{C}$ admits a $d$ -dimensional compression scheme if, given any finite subset $F$ of $X$ and some $f\in\mathcal{C}$ , there is a way of encoding the set $F$ with only $d$ -many elements of $F$ in such a way that $F$ can be recovered. We will give a formal definition, but we note that numerous variants of this idea appear throughout the literature. For instance:

•

Size $d$ -array compression [9].

•

Extended compression schemes with $b$ extra bits [13].

The next definition, which is the notion of compression we will work with in this section is equivalent to the notion of a $d$ -compression with $b$ extra bits (of Floyd and Warmuth) [17, see Proposition 2.1].

Definition 5.1.

We say that a concept class $\mathcal{C}$ has an $d$ -compression if there is a compression function $\kappa:\mathcal{C}_{fin}\rightarrow X^{d}$ and a finite set $\mathcal{R}$ of reconstruction functions $\rho:X^{d}\rightarrow 2^{X}$ such that for any $f\in\mathcal{C}_{fin}$

(1)

$\kappa(f)\subseteq dom(f)$ 2. (2)

$f=\rho(\kappa(f))|_{dom(f)}$ for at least one $\rho\in\mathcal{R}.$

We work with the above notion mainly because it is the notion used in [14], and our goal is to improve a result of Laskowski appearing there [14, Theorem 4.1.3]. In [17], Laskowski and Johnson prove that the concept class corresponding to a stable formula has an extended $d$ -compression for some $d$ . The precise value of $d$ is not determined, but was conjectured to be the Littlestone dimension. A later unpublished result of Laskowski appearing as [14, Theorem 4.1.3] in fact showed that one could take $d$ equal to the Shelah 2-rank (Littlestone dimension) and uses $2^{d}$ many reconstruction functions. In Theorem 5.4, we will show that $d+1$ many reconstruction functions suffice.

The question of Johnson and Laskowski is the analogue (for Littlestone dimension) of a well-known open question from VC-theory [13]: is there a bound $A(d)$ linear in $d$ such that every class of VC-dimension $d$ has a compression scheme of size at most $A(d)$ ? In general there is known to be bound which is at most exponential in $d$ [22].

Definition 5.2.

Suppose $\operatorname{Ldim}(\mathcal{C})=d$ . Given a partial function $f$ , say that $f$ is exceptional for $\mathcal{C}$ if for all $a\in\operatorname{dom}(f)$ ,

[TABLE]

has Littlestone dimension $d$ .

Definition 5.3.

Suppose $\operatorname{Ldim}(\mathcal{C})=d$ . Let $f_{\mathcal{C}}$ be the partial function given by

[TABLE]

It is clear that $f_{\mathcal{C}}$ extends any partial function exceptional for $\mathcal{C}$ .

Theorem 5.4.

Any concept class $\mathcal{C}$ of Littlestone dimension $d$ has an extended $d$ -compression with $(d+1)$ -many reconstruction functions.

Proof.

If $d=0$ , then $\mathcal{C}$ is a singleton, and one reconstruction function suffices. So we may assume $d\geq 1$ .

Fix some $f\in\mathcal{C}_{fin}$ with domain $F$ . We will run an algorithm to construct a tuple of length at most $d$ from $F$ by adding one element at each step of the algorithm. During each step of the algorithm, we also have a concept class $\mathcal{C}_{i}$ , with $\mathcal{C}_{0}=\mathcal{C}$ initially.

If $f$ is exceptional in $\mathcal{C}_{i-1}$ , then the algorithm halts. Otherwise, pick either:

•

$a_{i}\in F$ such that $f(a_{i})=1$ and

[TABLE]

has Littlestone dimension less than $\operatorname{Ldim}(\mathcal{C}_{i-1})$ . In this case, set $\mathcal{C}_{i}:=(\mathcal{C}_{i-1})_{(a_{i},1)}=\{g\,|\,g\in\mathcal{C}_{i-1},\,g(a_{i})=1\}.$

•

$d_{i}\in F$ such that $f(d_{i})=0$ and

[TABLE]

has Littlestone dimension less than $\operatorname{Ldim}(\mathcal{C}_{i-1})$ . In this case, set $\mathcal{C}_{i}:=(\mathcal{C}_{i-1})_{(d_{i},0)}.$

We allow the algorithm to run for at most $d$ steps. There are two distinct cases. If our algorithm has run for $d$ steps, let $\kappa(f)$ be the tuple $(\bar{a},\bar{d})$ of all of the elements $a_{i}$ as above followed by all of the elements $d_{i}$ as above for $i=1,\ldots,d$ . By choice of $a_{i}$ and $d_{i}$ , this tuple consists of $d$ distinct elements. By construction the set

[TABLE]

has Littlestone dimension [math], that is, there is a unique concept in this class. So, given $(c_{1},c_{2},\ldots,c_{n})\in X^{d}$ consisting of distinct elements, for $i=0,\ldots,d$ , we let $\rho_{i}(c_{1},\ldots,c_{n})$ be some $g$ belonging to

[TABLE]

if such a $g$ exists. By construction, for some $i$ , the Littlestone dimension of the concept class $\{g\in\mathcal{C}\cap F\,|\,g(c_{j})=1\text{ for }j\leq i,\,g(c_{j})=0\text{ for }j>i\}$ is zero, and so $g$ is uniquely specified and will extend $f$ .

We handle cases where the algorithm halts early by augmenting two of the reconstruction functions $\rho_{0}$ and $\rho_{1}$ defined above. Because $\rho_{0}$ and $\rho_{1}$ have so far only been defined for tuples consisting of $d$ distinct elements, we can extend these to handle exceptional cases by generating tuples with duplicate elements.

If the algorithm stops at some step $i>1$ , then it has generated a tuple of length $i-1$ consisting of some elements $a_{j}$ and some elements $d_{k}$ . Let $\bar{a}$ consist of the elements $a_{j}$ chosen during the algorithm, and let $\bar{d}$ consist of the elements $d_{k}$ chosen during the running of the algorithm. Observe that $f$ is exceptional for $\mathcal{C}_{(\bar{a},\bar{d})}$ .

If $\bar{a}$ is not empty, with initial element $a^{\prime}$ , then let $\kappa(f)=(\bar{a},a^{\prime},\bar{d},a^{\prime},\ldots,a^{\prime})\in F^{d}$ . From this tuple, one can recover $(\bar{a},\bar{d})$ (assuming $\bar{a}$ is nonempty), so we let $\rho_{1}(\bar{a},a^{\prime},\bar{d},a^{\prime},\ldots,a^{\prime})$ be some total function extending $f_{\mathcal{C}_{(\bar{a},\bar{d})}}$ , which itself extends $f$ . So $\rho_{1}(\bar{a},\bar{d})$ extends $f$ whenever the algorithm halts before step $d$ is completed and some $a_{i}$ was chosen at some point. If $\bar{a}$ is empty, then let $\kappa(f)=(\bar{d},d^{\prime},\ldots,d^{\prime})\in F^{d}$ , where $d^{\prime}$ is the initial element of $\bar{d}$ . From this tuple, one can recover $(\emptyset,\bar{d})$ (assuming $\bar{a}$ is empty), so we let $\rho_{0}(\bar{d},d^{\prime},\ldots,d^{\prime})$ be total function extending $f_{\mathcal{C}_{(\emptyset,\bar{d})}}$ , which itself extends $f$ . Finally, if the algorithm terminates during step 1, then it has generated the empty tuple. In this case, let $\kappa(f)=(c,\ldots,c)$ for some $c\in F$ . Then $\operatorname{Ldim}(\mathcal{C})=\operatorname{Ldim}({\mathcal{C}}_{(c,l)})$ for some $l\in\{0,1\}$ . In particular, if we have defined $\kappa(f^{\prime})=(c,\ldots,c)$ above for some $f^{\prime}$ where the algorithm only returns $c$ (rather than the empty tuple), then $1-l=f^{\prime}(c)\neq f(c)$ , and so any such $f^{\prime}$ is handled by $\rho_{1-l}$ . So we may overwrite $\rho_{l}$ to set $\rho(c,\ldots,c)$ to be a total function extending $f_{\mathcal{C}}$ , which itself extends $f$ . For any tuple output by our algorithm, one of the reconstruction functions produces an extension of the original concept.

∎

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Dana Angluin. Learning regular sets from queries and counterexamples. Information and computation , 75(2):87–106, 1987.
2[2] Dana Angluin. Queries and concept learning. Machine learning , 2(4):319–342, 1988.
3[3] Dana Angluin. Negative results for equivalence queries. Machine Learning , 5(2):121–150, 1990.
4[4] Dana Angluin and Tyler Dohrn. The power of random counterexamples. In International Conference on Algorithmic Learning Theory , pages 452–465, 2017.
5[5] Dana Angluin and Dana Fisman. Learning regular omega languages. Theoretical Computer Science , 650:57–72, 2016.
6[6] Dana Angluin and Dana Fisman. Regular omega-languages with an informative right congruence. ar Xiv preprint ar Xiv:1809.03108 , 2018.
7[7] Peter Auer and Philip M Long. Simulating access to hidden information while learning. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing , pages 263–272. ACM, 1994.
8[8] José L. Balcázar, Jorge Castro, David Guijarro, and Hans-Ulrich Simon. The consistency dimension and distribution-dependent learning from queries. Theoretical Computer Science , 288(2):197–215, 2002.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Bounds in Query Learning

Abstract.

1. Introduction

2. A combinatorial characterization of EQ-learnability

Definition 2.1**.**

2.1. EQ-learnability from Littlestone and consistency dimension

Proposition 2.2**.**

Proof.

Definition 2.3**.**

Definition 2.4**.**

Lemma 2.5**.**

Proof.

Theorem 2.6**.**

Proof.

Proposition 2.7**.**

Proof.

Theorem 2.8**.**

2.2. Obtaining finite consistency dimension

Definition 2.9**.**

Lemma 2.10**.**

Proof.

Proposition 2.11**.**

Proof.

Proposition 2.12**.**

Proof.

Corollary 2.13**.**

Theorem 2.14**.**

Proof.

2.3. From consistency to strong consistency

Example 2.15**.**

Definition 2.16**.**

Proposition 2.17**.**

Proof.

Example 2.18**.**

Example 2.19**.**

Theorem 2.20**.**

Proof.

Corollary 2.21**.**

Theorem 2.22**.**

Proof.

Proposition 2.23**.**

Proof.

2.4. Adding membership queries and efficient learning of finite classes

Theorem 2.24**.**

Proof.

Proposition 2.25**.**

Proof.

Theorem 2.26**.**

Proof.

Corollary 2.27**.**

2.5. The negation of the finite cover property

Definition 2.28**.**

Example 2.29**.**

3. Efficient learnability of regular languages

Proposition 3.1**.**

Proof.

Proposition 3.2**.**

Proof.

Theorem 3.3**.**

3.1. Learning ω\omegaω-languages

Proposition 3.4**.**

Proof.

Proposition 3.5**.**

Proof.

Theorem 3.6**.**

4. Random counterexamples and EQ-learning

Definition 4.1**.**

Lemma 4.2**.**

Definition 4.3**.**

Definition 4.4**.**

Lemma 4.5**.**

Proof.

Definition 4.6**.**

Definition 2.1.

Proposition 2.2.

Definition 2.3.

Definition 2.4.

Lemma 2.5.

Theorem 2.6.

Proposition 2.7.

Theorem 2.8.

Definition 2.9.

Lemma 2.10.

Proposition 2.11.

Proposition 2.12.

Corollary 2.13.

Theorem 2.14.

Example 2.15.

Definition 2.16.

Proposition 2.17.

Example 2.18.

Example 2.19.

Theorem 2.20.

Corollary 2.21.

Theorem 2.22.

Proposition 2.23.

Theorem 2.24.

Proposition 2.25.

Theorem 2.26.

Corollary 2.27.

Definition 2.28.

Example 2.29.

Proposition 3.1.

Proposition 3.2.

Theorem 3.3.

3.1. Learning $\omega$ -languages

Proposition 3.4.

Proposition 3.5.

Theorem 3.6.

Definition 4.1.

Lemma 4.2.

Definition 4.3.

Definition 4.4.

Lemma 4.5.

Definition 4.6.

Theorem 4.7.

Fact 4.8.

Theorem 4.9.

Theorem 4.10.

Definition 5.1.

Definition 5.2.

Definition 5.3.

Theorem 5.4.