Bounds in Query Learning
Hunter Chase, James Freitag

TL;DR
This paper develops new combinatorial tools to analyze query learning complexity, providing bounds, simplified proofs for learnability of language classes, and algorithms for efficient learning in various models, including randomized settings.
Contribution
Introduces new combinatorial quantities for concept classes, offering bounds and simplified proofs for learnability, and algorithms for efficient query learning in multiple models.
Findings
New bounds on learning complexity in query models
Efficient algorithms for learning regular languages
Connections between query learning and model theory
Abstract
We introduce new combinatorial quantities for concept classes, and prove lower and upper bounds for learning complexity in several models of query learning in terms of various combinatorial quantities. Our approach is flexible and powerful enough to enough to give new and very short proofs of the efficient learnability of several prominent examples (e.g. regular languages and regular -languages), in some cases also producing new bounds on the number of queries. In the setting of equivalence plus membership queries, we give an algorithm which learns a class in polynomially many queries whenever any such algorithm exists. We also study equivalence query learning in a randomized model, producing new bounds on the expected number of queries required to learn an arbitrary concept. Many of the techniques and notions of dimension draw inspiration from or are related to notions from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Algorithms and Data Compression · Optimization and Search Problems
Bounds in Query Learning
Hunter Chase
Department of Mathematics, UIC, Chicago IL
and
James Freitag
Department of Mathematics, UIC, Chicago IL
Abstract.
We introduce new combinatorial quantities for concept classes, and prove lower and upper bounds for learning complexity in several models of query learning in terms of various combinatorial quantities. Our approach is flexible and powerful enough to enough to give new and very short proofs of the efficient learnability of several prominent examples (e.g. regular languages and regular -languages), in some cases also producing new bounds on the number of queries. In the setting of equivalence plus membership queries, we give an algorithm which learns a class in polynomially many queries whenever any such algorithm exists.
We also study equivalence query learning in a randomized model, producing new bounds on the expected number of queries required to learn an arbitrary concept. Many of the techniques and notions of dimension draw inspiration from or are related to notions from model theory, and these connections are explained. We also use techniques from query learning to mildly improve a result of Laskowski regarding compression schemes.
Partially supported by NSF grant no. 1700095
1. Introduction
Fix a set and denote by the collection of all subsets of . A concept class111We will also sometimes call a set system on on is a subset of . In the equivalence query (EQ) learning model, a learner attempts to identify a target set by means of a series of data requests called equivalence queries. The learner has full knowledge of , as well as a hypothesis class with An equivalence query consists of the learner submitting a hypothesis to a teacher, who either returns yes if , or a counterexample . In the former case, the learner has learned , and in the latter case, the learner uses the new information to update and submit a new hypothesis. In sections 2 and 3, the teacher may be assumed to be adversarial and the worst case number of queries required to learn any concept is analyzed. In section 4, we consider the case in which the teacher selects counterexamples randomly according to a fixed but arbitrary distribution.
We will also consider learning with equivalence and membership queries (EQ+MQ). In a membership query, a learner submits a single element from the base set to the teacher, who returns the value , where is the target concept. In this setting, the learner may choose to make either type of query at any stage, submitting any for a membership query or submitting any for an equivalence query. The learner learns the target concept when they submit as an equivalence query.
With Theorems 2.6 and 2.24, we give upper bounds for the number of queries required for EQ and EQ+MQ learning a class with hypotheses in terms of the Littlestone dimension of , denoted , and the consistency dimension of with respect to , denoted . We also give lower bounds for the number of required queries in terms of these quantities. In the EQ+MQ setting, the bounds are tight enough to completely characterize when a problem is efficiently learnable. Littlestone dimension is well-known in learning theory [18] and model theory.222In model theory, Littlestone dimension is called Shelah 2-rank, see [10] for additional details.
Consistency dimension and the related notion of strong consistency dimension are more subtle, which we detail in section 2. When is taken to be , ; for various examples of set systems with , one has . In 2.2, we define a new invariant, the consistency threshold of , and provide a construction (for arbitrary ) of a hypothesis class which is not much more complicated than (of the same Littlestone dimension as ) such that In 2.3, we compare our bounds and invariants to those previously appearing in the literature.
Theorems 2.6 and 2.24 can be used to establish efficient learnability in specific applied settings if one can obtain appropriate bounds on Littlestone dimension and consistency dimension. Let be a collection of concept and hypothesis classes which depends on some parameter . Typically, we are thinking of finite classes which grow with . We prove that whenever can be learned by an algorithm using polynomially many membership queries and equivalence queries from , there must be polynomial bounds on Littlestone and consistency dimension. Moreover, whenever such an algorithm exists, the algorithm given in Theorem 2.24 accomplishes this.
Finally, to close section 2, we explain the connection between strong consistency dimension and a model theoretic property called the finite cover property (fcp), or rather its negation, referred to henceforth as the nfcp. We show that if is the set system given by uniform instances of a fixed first order formula , and is the collection of externally -definable sets, then has finite strong consistency dimension if and only if has the nfcp.
In section 3 we demonstrate the practicality of our approach by providing simple and fast proofs of the efficient learnability of regular languages and certain -languages, reproving results of [1, 5, 12, 11]. Besides the conceptual simplicity of the approach, the bounds in learning complexity resulting from our algorithm have some novel aspects. For instance, our bounds have no dependence on the length of the strings provided to the learner as counterexamples, in contrast to existing algorithms.
In section 4 we turn to a randomized variant of EQ-learning in which the teacher is required to choose counterexamples randomly from a known probability distribution on . [4] show that for a concept class of size , there is an algorithm in which the expected number of queries to learn any concept is at most It is natural to wonder whether there is a notion of dimension which can be used to bound the expected number of queries. In fact, Angluin and Dohrn [4, Theorem 25] already consider this, and show that the VC-dimension of the concept class is a lower bound on the number of expected queries. However, [4, Theorem 26], using an example of [18], shows that the VC-dimension cannot provide an upper bound for the number of queries. We show that the Littlestone dimension provides such an upper bound; we give an algorithm which yields a bound which is linear in the Littlestone dimension for the expected number of queries needed to learn any concept.
In section 5, we introduce compression schemes for concept classes. Specifically, the notion we work with is equivalent to -compression with extra bits (of Floyd and Warmuth [13]). In [17], Laskowski and Johnson proved that the concept class corresponding to a stable formula has an extended -compression for some . Later, a result of Laskowski appearing as [14, Theorem 4.1.3] in fact showed that one could take equal to the Shelah 2-rank (Littlestone dimension) and uses many reconstruction functions. We show that many reconstruction functions suffice.
2. A combinatorial characterization of EQ-learnability
Often, one assumes that is finite, and the emphasis is placed on finding bounds on the number of queries it may take to learn any . We also consider the case where is infinite, for which we give the following definition.
Definition 2.1**.**
Let and be set systems on a set . is learnable with equivalence queries from if there exists some and some algorithm to submit hypotheses from such that any concept is learnable in at most equivalence queries, given any teacher returning counterexamples. Let be the least such if is learnable with equivalence queries from , and otherwise.
is called the learning complexity, representing the optimal number of queries needed in the worst-case scenario.
Similarly, is learnable with equivalence queries from and membership queries if there exists some and some algorithm to submit membership queries from or equivalence queries from such that any concept is learnable in at most equivalence queries. The learning complexity is defined similarly and is denoted by .
2.1. EQ-learnability from Littlestone and consistency dimension
Proposition 2.2**.**
[18, Theorems 5 and 6]** If , then . If , then the converse holds.
Proof.
Suppose . We show that we can force the learner to use at least equivalence queries. Construct a binary element tree of height with proper labels from witnessing . Given the first hypothesis from the learner, return the element on the 0th level on the tree as a counterexample. Continue this, returning the element on the th level along the path consistent with previous counterexamples as the counterexample to hypothesis . We will return counterexamples, and the learner still requires one more hypothesis to identify the concept. Since this will occur for one of the proper labels of the binary element tree, we have forced the learner to use at least equivalence queries for some .
Suppose . Let . Inductively define , as follows. Given , for any and , let
[TABLE]
where is the characteristic function on . Let
[TABLE]
Submit as the hypothesis. If is correct, we are done. Otherwise, we receive a counterexample . Set
[TABLE]
to be the concepts which have the correct label for . Observe that at each stage, . Therefore, if we make queries without correctly identifying the target, then we must have . Then is a singleton, which must be the target concept. ∎
Notice in particular that if , then cannot be learned with equivalence queries, even with . The assumption that makes learning straightforward, but this may be too strong for many settings. However, without some additional hypotheses on , learnability may already be hopeless, even for very simple set systems. For instance, let be the set of singletons. If , then we may take as long as to learn if is finite, or never learn at all if is infinite. However, if the learner is allowed to guess , this forces the teacher to identify the target singleton.
The strategy of Proposition 2.2 permeates both learnability and non-learnability proofs; identifying a specific set amounts to reducing the Littlestone dimension of the family of possible concepts to 0; actually submitting the target concept before the Littlestone dimension reaches 0 can be thought of as a best-case scenario that we cannot rely on. Non-learnability then amounts to an inability to reduce the Littlestone dimension of the family of possible concepts to 0 through a series of finitely many equivalence queries. The main purpose of this section is to give precise conditions on and which characterize learnability.
Definition 2.3**.**
Given a set , a partially specified subset of is a partial function .
- •
Say if , if , and membership of is unspecified otherwise. The domain of , , is . Call total if . We identify subsets with total partially specified subsets. The size of , , is the cardinality of .
- •
Given two partially specified subsets and , write if and agree on ; call a restriction of and an extension of .
- •
Given a set , the restriction of to is the partial function where for all , and is unspecified otherwise.
- •
Given a set system on , is -consistent with if every size restriction of has an extension in . Otherwise, say is -inconsistent. is finitely consistent with if every restriction of of finite size has an extension in —that is, is -consistent with for all .
The following definition is a translation into set systems of a definition that first appeared in [8].
Definition 2.4**.**
The consistency dimension of with respect to , denoted , is the least integer such that for every subset (viewed as a total partially specified subset), if is -consistent with , then . If no such exists, then say .
Observe that iff shatters333Recall that a set system shatters a set if, for all , there is such that . the set of all elements such that there are and in such that but . In this case, it is possible to learn any concept in in at most equivalence queries, using the method of Proposition 2.2. So we may assume that .
Lemma 2.5**.**
Suppose that for each , is a concept class on and is a hypothesis class on . Suppose that . Then , where and .
Proof.
We give the proof for ; then the result for follows easily by induction.
To learn a target concept with hypotheses from , begin by assuming that . Attempt to learn by making guesses from , according to the procedure by which any concept in is learnable in at most many queries. If, after making many queries, we have failed to learn , then we conclude that , whence . We can then learn in at most many additional queries with guesses from . ∎
We can now give an upper bound for the learning complexity in terms of Littlestone dimension and consistency dimension.
Theorem 2.6**.**
Suppose and . Then .
Proof.
We proceed by induction on . The base case, , is trivial, as then is a singleton.
Suppose there is some element such that and , where and . Then by induction, any concept in can be learned in at most queries with guesses from , and the same is true for . Then by Lemma 2.5, any concept in can be learned in at most equivalence queries.
If no such exists, then for all , either or . Let be such that iff .
If , then we submit as our query. If we are incorrect, then by choice of , the class of concepts consistent with the counterexample will have Littlestone dimension . By induction, any concept in can be learned in at most many queries, and so we learn in at most queries.
If , then, since , there are some such that there is no such that . Then, with notation as in the proof of Proposition 2.2,
[TABLE]
and for each . Then, by induction, for each , any concept in can be learned in at most many queries with guesses from . By Lemma 2.5, any concept in can be learned in at most many queries with guesses from . ∎
On the other hand, Proposition 2.2 gives a lower bound of . There is also a lower bound for learning complexity in terms of consistency dimension:
Proposition 2.7**.**
[8, Theorem 2]** Suppose there is some partially specified subset which is -consistent with but which does not have a total extension in . Then .
Proof.
By hypothesis, given any equivalence query , the teacher can find some such that . Moreover, since is -consistent with , the teacher is able to return a counterexample of this form for the first equivalence queries. Thus cannot be learned with fewer than equivalence queries from . ∎
In particular, if , then there is some subset which is -consistent with but which does not belong to . Then . So . In fact, we will obtain a stronger bound using strong consistency dimension in section 2.3.
Furthermore, if , then cannot be learned with equivalence queries from . Combining Theorem 2.6 and Propositions 2.2 and 2.7, we obtain the following:
Theorem 2.8**.**
* is learnable with equivalence queries from iff and .*
2.2. Obtaining finite consistency dimension
We have established that finite consistency dimension is essential for EQ-learning. The central question we answer in this subsection is: given , can one obtain a hypothesis class which is not much more complicated than with the property that is finite?
Definition 2.9**.**
Fix a set system on a set . has consistency threshold if, given any hypothesis class , we have that
[TABLE]
Lemma 2.10**.**
Suppose is a partially specified subset finitely consistent with . Then there is a total extension finitely consistent with .
Proof.
Let be a well-ordering of . Let . We inductively define a -chain of partially specified subsets , where each is defined on and is finitely consistent with . For a limit ordinal, set . It is clear that is finitely consistent with if all for are.
At any successor stage , if , set . Otherwise, we must extend to while remaining finitely consistent with . Assume for contradiction that neither nor are finitely consistent with . Then there are finite sets such that and have no extension in . But has an extension in , and must be an extension of either or , a contradiction. So has a finitely consistent extension to , and we set to be such an extension.
We then take . ∎
Proposition 2.11**.**
Let be a set systems and let be a partially specified subset. The following are equivalent:
**(i): **
* is finitely consistent with .*
**(ii): **
If , then there is a total extension in .
Proof.
(i) (ii): Let be a total extension finitely consistent with . If , then .
(ii) (i): We show the contrapositive. Suppose that is not finitely consistent with , witnessed by some size restriction , which is a -minimal such restriction. We find some such that but contains no total extension of . Let be the collection of all (total partially specified) subsets which are not extensions of . So has no total extension in . We claim that . Indeed, observe that given any (total partially specified) subset that is -consistent with , we have , and then .
∎
In particular, if , then contains all finitely consistent subsets. That is, extensions of all finitely consistent partially specified subsets (equivalently, by Lemma 2.10, all finitely consistent total partially specified subsets) are necessary to obtain . Consistency threshold classifies when this is a sufficient condition.
Proposition 2.12**.**
The following are equivalent:
**(i): **
* has consistency threshold .*
**(ii): **
For all (total partially specified) subsets , if is -consistent with , then is finitely consistent with .
**(iii): **
If contains all finitely consistent (total partially specified) subsets, then .
Proof.
(i) (ii): Assume for contradiction that there is some total which is -consistent but not finitely consistent. Let be minimal such that is -inconsistent. Then there is a size restriction that has no extension in . Then let contain all subsets which do not extend .
We claim that . Note that witnesses that . On the other hand, observe that given any partially specified subset that is -consistent with , we have , and then it is easy to see that has a total extension in .
(ii) (iii): If contains all finitely consistent subsets, and all -consistent subsets are finitely consistent, then holds immediately.
(iii) (i): By Proposition 2.11, if , then already has all finitely consistent subsets. Then . ∎
In particular, if has finite consistency threshold, then iff contains all finitely consistent subsets.
Corollary 2.13**.**
Suppose does not have finite consistency threshold. Then for arbitrarily large , there is some total subset which is -consistent but not -consistent with .
Finite consistency threshold is not strictly necessary to provide a positive answer to the central question of this subsection; nevertheless, it does identify a clear qualitative dividing line. When has finite consistency threshold, only needs to contain all finitely consistent subsets; letting be the set of all finitely consistent subsets, we obtain a minimum hypothesis class such that learning is possible.
Where does not have finite consistency threshold, more is required; we must add some hypotheses which are inconsistent with the concepts in , and there is no minimal such that learning is possible. However, for each , we can replace “finitely consistent” with “-consistent” to obtain a class such that —let be the collection of all subsets which are -consistent with . Note that is clearly the minimum hypothesis class such that .
Note that for all , . By Proposition 2.12, if has consistency threshold , then for all , . If does not have finite consistency threshold, there is no minimal such that ; by Corollary 2.13, if , then there is such that .
By choosing appropriately, given any , we can find a hypothesis class such that without increasing the Littlestone dimension; that is, .
Theorem 2.14**.**
Suppose . Then there is such that and . Furthermore, we can find such an such that .
Proof.
Fix some . Let be the collection of all subsets which are -consistent with . It is immediate that .
Assume for contradiction that . Consider a binary element tree of height that can be properly labeled with elements of ; in particular, there is some leaf which cannot be labeled with an element of . Consider such a leaf. The path through the binary element tree to this leaf defines a partially specified subset that is -inconsistent with . In particular, any total extension is -inconsistent, so -inconsistent, and so does not belong to . This contradicts our ability to label the leaf with an element of .
In particular, recall that when has finite consistency threshold , is -consistent with iff it is finitely consistent with . So setting as above with at least the finite consistency threshold amounts to setting to be the collection of all finitely consistent partially specified subsets. In this case, even if , as increasing the Littlestone dimension requires adding something inconsistent with .
Regardless of whether has finite consistency dimension, we can let . Then . ∎
2.3. From consistency to strong consistency
From an algorithms perspective, the result of Theorem 2.6 is unsatisfactory, since it is exponential in . We give an example to show that, without modification, we cannot expect a significant improvement.
Example 2.15**.**
Fix and . Let be distinct elements indexed by finite nonempty sequences of length at most from . For , let . Let . Then .
If we take to also be our hypothesis class, then . Indeed, the (total partially specified) subset is -consistent but not consistent with , witnessed by the restriction of to , so . On the other hand, if is a subset -consistent with , then, by induction on the length of , for each , contains exactly one with , so .
However, it may take as long as many equivalence queries to learn; if the teacher returns as a counterexample to hypothesis , then the learner can only eliminate .
The most promising modification is the following variant of consistency dimension, which also appeared in [8] in a slightly different form.
Definition 2.16**.**
The strong consistency dimension of with respect to , denoted , is the least integer such that for every partially specified subset , if is -consistent with , then has an extension in . If no such exists, then say .
We therefore make the stronger requirement that all partially specified subsets that are -consistent be consistent, rather than just all totally partially specified subsets. It is immediate from the definition that . At the smallest levels, consistency dimension and strong consistency dimension are equal.
Proposition 2.17**.**
If , then . If , then .
Proof.
Observe that iff iff shatters the set of all elements such that there are and in such that but .
Suppose that . Let be a partially specified subset that is 2-consistent with . We wish to find a total extension of in . It suffices to find a total extension that is 2-consistent with .
Let be a well-ordering of . Let . We inductively define a -chain of partially specified subsets , where each is defined on and is 2-consistent with . For a limit ordinal, set . It is clear that is 2-consistent with if all for are.
At any successor stage , if , set . Otherwise, we must extend to while remaining 2-consistent with . Assume for contradiction that neither nor are 2-consistent with . Then there are , such that and have no extension in . But has an extension in , and must be an extension of either or , a contradiction. So has a 2-consistent extension to , and we set to be such an extension.
We then take to be our total extension. ∎
As the following examples show, consistency dimension and strong consistency dimension may differ when .
Example 2.18**.**
Let . Let
[TABLE]
One can verify that , but the partially specified subset with unspecified witnesses that .
Example 2.19**.**
Continuing Example 2.15, observe that . In particular, the partially specified subset given by
[TABLE]
witnesses that . Then we learn in at most many queries. Moreover, this demonstrates that consistency dimension and strong consistency dimension can differ by an arbitrarily large amount (allowing to vary), and that strong consistency dimension may even be exponentially larger than consistency dimension.
Strong consistency dimension, like consistency dimension, categorizes equivalence query learning:
Theorem 2.20**.**
* is learnable with equivalence queries from iff and . In particular, .*
Proof.
For the reverse direction, use Theorem 2.6 and the observation that .
For the forward direction, use Propositions 2.2 and 2.7. In particular, if , then there is a partially specified subset that is -consistent with but which has no total extension in . Then, by Proposition 2.7, . ∎
Corollary 2.21**.**
Suppose . Then iff .
The distinction between consistency dimension and strong consistency dimension is subtle, and many previous results hold with little to no modification if one replaces consistency dimension with strong consistency dimension. On the other hand, our work in section 3 will reveal the practical difficulties associated with with strong consistency dimension in complicated concept classes.
We have already seen in Theorem 2.20 that strong consistency dimension provides a better lower bound for learning complexity. It is also known in the finite case that strong consistency dimension also gives a stronger upper bound for learning complexity:
Theorem 2.22**.**
[8, Theorem 2]** Suppose is finite. Then .
Proof.
As this was originally framed in the setting where concepts were represented by strings, we give an abbreviated translation of the original proof into the language of set systems. This proof demonstrates the utility of constructing a partial hypothesis and taking some complete extension.
Let . At stage , let be the set of remaining possible target concepts. Let be the partially specified subset given by
[TABLE]
Observe that is -consistent with —given any , for each , less than many remaining concepts disagree with on , so less than many concepts disagree with on some . So some concept agrees with on . So is -consistent.
So we can find some such that , and we submit as our hypothesis. By choice of , if we receive a counterexample, we will have . Repeating this many times is enough to identify and submit the target concept. ∎
In light of Example 2.19, one hopes that improved bounds on learning can be found in terms of strong consistency dimension and Littlestone dimension when is infinite. We are unable to show this presently, but offer some evidence in this direction:
Proposition 2.23**.**
Suppose and . Then .
Proof.
We know by Proposition 2.2 that is a lower bound. We show that it is also an upper bound.
Let . Inductively define , as follows. Given , for any and , let
[TABLE]
where is the characteristic function on . Construct the partially specified subset where
[TABLE]
We claim that has an extension in . By our assumption that , it suffices to check that is 2-consistent with . Suppose for contradiction that there are such that, without loss of generality, , but there is no extension of in . Then observe that , whence
[TABLE]
so . But we also have , a contradiction, as we could then construct a binary element tree with proper labels from of height with at the root.
Let be a total extension of . Submit as the hypothesis. If is correct, we are done. Otherwise, we receive a counterexample . Set
[TABLE]
Observe that at each stage, . Therefore, if we make queries without correctly identifying the target, then we must have . Then is a singleton, which must be the target concept.
∎
The proof of Proposition 2.23 uses strong consistency in a key way, as the hypothesis is generated by extending a certain partially specified subset. Nevertheless, the conclusion holds under the assumption that , due to Proposition 2.17.
2.4. Adding membership queries and efficient learning of finite classes
Consistency dimension was originally derived from the notion of polynomial certificates, which was used to characterize learning with equivalence and membership queries in the finite case by [15]. The following is an improvement of the upper bound on EQ+MQ learning complexity of implicit in the proof of Theorem 3.1.1 in [15] (stated explicitly in [8]). Our bound replaces with .
Theorem 2.24**.**
Suppose and . Then , where .
Proof.
444The algorithm is similar to that of Theorem 2.6. However, the applications of Lemma 2.5 are replaced with membership queries.
We proceed by induction on . The base case, , is trivial, as then is a singleton. Suppose there is some element such that and , where and . Then by induction, any concept in can be learned in at most queries with guesses from , and the same is true for . Submit as a membership query. This tells us whether the target concept lies in or , and then we require at most many queries, for a total of many queries.
If no such exists, then for all , either or . Let be such that iff .
If , then we submit as our query. If we are incorrect, then by choice of , the class of concepts consistent with the counterexample will have Littlestone dimension . By induction, any concept in can be learned in at most many queries, and so we learn the target in at most queries.
If , then, since , there are some such that there is no such that . (Observe that this cannot happen when . In fact, Proposition 2.17 and the proof of Proposition 2.23 imply that this cannot even happen when . In particular, .) Then, with notation as in the proof of Proposition 2.2,
[TABLE]
and for each . By induction, any concept in each can be learned in at most many queries. By submitting as membership queries, we can determine some such that the target belongs to (if the result of each membership query on is , then we know that ). We therefore learn in at most many queries. ∎
We have a lower bound on learning complexity in terms of consistency dimension in this setting analogous to Proposition 2.7:
Proposition 2.25**.**
Suppose there is some (total) subset which is -consistent with but which does not have a total extension in . Then . In particular, .
Proof.
We first show that . If the learner submits as a membership query, the teacher returns if possible, that is, if there is a concept which agrees with the previous data and satisfies .
By hypothesis, given any equivalence query , the teacher can find some such that , and the teacher returns a counterexample of this form if possible, that is, if there is a concept which agrees with the previous data and satisfies .
Moreover, since is -consistent with , the teacher is able to return data of this form for the first queries. Thus cannot be learned with fewer than equivalence queries from .
From this, it follows that . ∎
Finally, putting together the various upper and lower bounds from this section we give a characterization of those problems efficiently learnable by equivalence and membership queries:
Theorem 2.26**.**
Let be a family of concept classes and hypothesis classes, respectively. Let Let The following are equivalent:
**(i): **
* is bounded by a polynomial in .*
**(ii): **
* and are bounded by a polynomial in .*
**(iii): **
The algorithm from Theorem 2.24 learns in at most polynomially in many membership queries and equivalence queries in .
Proof.
(ii) (iii) follows immediately from Theorem 2.24, and (iii) (i) follows by definition of learning complexity.
(i) (ii): In Proposition 2.25, we showed that so it follows that if is not polynomially bounded then neither is .
Now suppose that is not polynomially bounded. By [7, Theorem 2.1] 555The inequality of [7] gives a lower bound for which improved on the lower bound of from [20, Theorem 3]. In fact, Theorem 3 of [20] actually suffices for our purposes. we have
[TABLE]
By [18, Theorems 5 and 6], we can replace with Thus:
[TABLE]
from which it follows that is not polynomially bounded. ∎
Finally, the upper and lower bounds of this section also yield a characterization of which infinite classes are learnable in finitely many equivalence and membership queries.
Corollary 2.27**.**
* iff and .*
2.5. The negation of the finite cover property
One can compare set systems with finite strong consistency dimension, to the model-theoretic classes of formulas and theories without the finite cover property, which we define below. Informally, the negation of the finite cover property allows for a specific quantitative bound for applications of compactness.
Definition 2.28**.**
Fix a first order theory . A formula in the language of does not have the finite cover property (ncfp) if there is such that for all , and every , the following holds: if every of size is consistent, then is consistent. We let denote the minimal such .
does not have the finite cover property if all formulas do not have the finite cover property.666The definition of nfcp on the formula level given here is stronger than original formulation in [26], but it gives an equivalent characterization on the level of theories.
Consider the setting where is generated by a formula , that is, and
[TABLE]
That is, consists of the -definable sets. Suppose does not have the finite cover property, witnessed by some . Then, given any disjoint , if every size subset of
[TABLE]
is consistent, then is consistent. We can identify this partial type with the partially specified subset where
[TABLE]
By passing to an -saturated extension to obtain a larger parameter set, we can find satisfying . Then is a total extension of .
Supposing we have passed to an saturated extension , we can let
[TABLE]
That is, consists of all externally -definable subsets of , as contains realizations of all consistent partial -types over . By the compactness theorem, this means that contains realizations of all finitely consistent partial -types over . Having identified partially specified subsets of with their corresponding -type, this amounts to observing that contains total extensions of all finitely consistent partially specified subsets, equivalently, contains all finitely consistent total subsets.
This gives a model-theoretic motivation to the strategy suggested by Proposition 2.12. Adding all finitely consistent subsets to amounts to saturating so as to realize all -types over .
If has nfcp with , then the finitely consistent partial types are exactly the -consistent types. Then contains total extensions of all -consistent partially specified subsets, so . Note that -types witnessing that is the minimal such at which has nfcp give partially specified subsets witnessing that . This reflects a variant of Proposition 2.12 for strong consistency dimension.
In particular, formulas such that has nfcp provide a rich family of examples where has finite (strong) consistency threshold. That is, for such , it is necessary and sufficient for to contain all externally -definable subsets (that is, all total finitely consistent partially specified subsets) to obtain . On the other hand, when has the finite cover property, the externally definable sets are no longer sufficient, and one must venture beyond the sets is capable of cutting out to obtain (that is, by adding some sets which are inconsistent).
Furthermore, Littlestone dimension of (that is, the Littlestone dimension of ) is expressible as a first-order property. So we will have . So when the context is a set system generated by a stable formula with nfcp, we can obtain a set system such that , but is not much more complicated than the original set system - has the same Littlestone dimension as . This is essentially the content of Theorem 2.14 when has finite consistency threshold.
We give an example from model theory where has the fcp.
Example 2.29**.**
Let be a structure in the language , where is an equivalence relation with one class of size for each , possibly with some infinite classes. Let
[TABLE]
and let .
Suppose are the elements belonging to the equivalence class of size . Then the -type -consistent but inconsistent. Since there are equivalence classes of arbitrarily large size, these witness that is not nfcp. One can check that .
In any -saturated elementary extension of , no additional elements are added to the finite equivalence classes already present in , though adds new infinite classes and new elements to any existing infinite classes.
An attempt to learn by equivalence queries following the strategy of Theorem 2.6 would be as follows. We are attempting to identify some . Letting be an element in a new infinite equivalence class in , we guess . Then any counterexample will identify an element belonging to the equivalence class of . If belongs to an infinite class, then we can find some which is a new element of this class. Then . Then is the only available counterexample, and we submit the correct concept at our next turn. However, if belongs to the finite class of size , then has no new elements in this class. Then the relevant queries, which are of the form for in the class of , are already present in . Then we are essentially attempting to identify a singleton from a set of size , and it is clear that the process could take up to additional guesses.
3. Efficient learnability of regular languages
In a seminal paper, [1] showed that regular languages are efficiently learnable with equivalence queries plus membership queries, and in this subsection, we will use Theorem 2.24 to give an alternate short proof of this fact.777In the following sections, we only make use of proper equivalence queries, that is, . We shall therefore let , which we will call the consistency dimension of (with analogous notation for strong consistency dimension). Let be the class of binary regular languages on strings of length at most specified by a deterministic finite automaton on at most nodes. The algorithm of [1] specifically uses equivalence queries and membership queries. We let denote the collection of (equivalence classes of) deterministic finite automata accepting binary strings and having at most nodes. The proof of the next proposition is straightforward.
Proposition 3.1**.**
The Littlestone dimension of is at most .
Proof.
In [16, Proposition 1], it is shown that From this, it follows that the Littlestone dimension of is at most . ∎
The proof of the following proposition reveals the connection between consistency and the Myhill-Nerode theorem.
Proposition 3.2**.**
.
Proof.
Fix a subset of binary strings and binary strings. We say that is a (-) distinguishing extension of and if but or vice versa. If and have no distinguishing extension, then we say and are -equivalent, and write . The Myhill-Nerode theorem [23] says that a subset of binary strings of length is the accept set of a finite automaton with at most nodes if and only if the number of classes is at most . Thus, given any subset of the binary strings of length which is not a regular language recognized by an automaton with at most nodes, there are at least -classes of elements. Pick representatives from classes, and for each , pick some that is a distinguishing extension of and . Then restricting to the partial assignment on , a domain of size that witnesses that for all , we can see that this restriction is inconsistent with the class of regular languages recognized by automata with at most nodes. Therefore . 888Note that the same proof shows that the consistency dimension of is also at most . ∎
Now, by Theorem 2.24 and the previous two results, it follows that:
Theorem 3.3**.**
The class is learnable in at most equivalence queries and at most membership queries.
It is interesting to note that contrary to , when using the algorithm from Theorem 2.24, there is no dependence on , the length of the binary strings which the teacher is allowed to provide as counterexamples999We should also note that was improved by Schapire to give a better bound on membership queries (still depending on ). [25]..
Theorem 2.6 now implies that is learnable in at most equivalence queries. Theorem 2.22 shows that a finite class is learnable in at most equivalence queries. Since [3] showed that is not learnable in polynomially many equivalence queries, it follows that cannot be polynomial in .
3.1. Learning -languages
In this section, we consider the natural extension to languages on infinite strings indexed by , called -languages. For an alphabet , we denote by , the strings of symbols from of order type . Similar to the previous section, we consider an automaton, which consists of the collection where is a finite collection of states, is the initial state, and is a transition rule. To form a language, an automaton is equipped with an acceptance criterion.101010Numerous acceptance criteria have been extensively studied in the literature, and we refer the reader to [5, 12, 11] for overviews. Fix a subset . A run of a Büchi automaton is accepting if and only if it visits the set infinitely often. An -language is -regular if it is recognized by a non-deterministic Büchi automaton. A run of a co-Büchi automaton is accepting if and only if it visits only finitely often. Let be a function, which we think of as a coloring of the states of the automaton. Let be the minimum color which is visited infinitely often. A run of a parity automaton is accepting if and only if is odd.
Two -regular languages are equivalent if they agree on the set of periodic words [21], which allows for the possibility of recognizing the -language using finitary automata. This is the approach of [5, 12], whose notation we follow closely. A family of DFAs (FDFA) is a pair where is a DFA with states and is a collection of many DFAs, which we refer to as progress DFAs - one DFA for each state of . Given a pair of finite words, , a run of our family of DFAs consists of running on , then running on where is the ending state of on . The pair can be used to represent an infinite periodic word .
Let be the class of families of deterministic finite automata where the leading automaton has at most nodes and the progress automata each have at most nodes. It is not quite true that once an -regular language has been reduced to an FDFA that one can use directly to learn the various DFAs in the family [5, section 4]. It is also not completely obvious what the bounds for Littlestone and consistency dimension are in terms of the DFAs in the family, but the next two results give such bounds which imply the efficient learnability of -regular languages.
Proposition 3.4**.**
The class has Littlestone dimension at most
Proof.
The number of FDFAs of size is clearly at most That is
[TABLE]
It follows that
[TABLE]
and using [16, Proposition 1], the desired bound follows. ∎
Proposition 3.5**.**
.
Proof.
A run of an FDFA on can be simulated by the run of an appropriate automaton in the class To see this, input word u\v$u,v$from each state of the leading automaton to the initial state of the corresponding progress DFA. Now it follows by Proposition [3.2](#S3.Thmthm2) that the consistency dimension ofFDFA(n,m)2\binom{n(m+1)}{2}.$ ∎
Using the previous two results together with Theorem 2.24, one can deduce the efficient learnability of :
Theorem 3.6**.**
The class is learnable in at most equivalence queries and at most membership queries.
We have formulated our bounds in terms of the number of states in the FDFA corresponding to a given -language. In [5, 12] bounds on the number of states of FDFAs in terms of the number of states of automata for -languages with various acceptors are given. Specifically, the following bounds hold:
- (1)
When is a deterministic Büchi (DBA) or co-Büchi (DCA) automaton with states, there is an equivalent FDFA of size at most [12, 5.3]. 2. (2)
When is a deterministic partiy automaton (DPA) with states and colors, there is an equivalent FDFA of size at most [12, 5.4]. 3. (3)
When is an nondeterministic Büchi automaton (NBA) with states, there is an equivalent FDFA of size at most .
Any NBA can be translated into a DPA, and so 2) yields the efficient learnability of -regular languages in terms of the number of states in a DPA (this translation also yields 3). However, the translation from NBA to DPA is known to require an exponential increase in the number of states in general [24]. From an FDFA of size at most there is a translation into an NBA with at most states [12, Theorem 5.8], and so it follows that the exponential increase in states in moving from NBAs to FDFAs is necessary [12, Theorem 5.6].
Finally, we mention that [6] define restricted classes of -languages for which right-congruence is fully informative, and isolate numerous classes (e.g. for each type of acceptor from the previous subsection) of -languages for which an infinitary invariant of the Myhill-Nerode theorem holds. This variant of Myhill-Nerode is sufficient to bound the consistency dimension (and thus establish the learnability) of the classes in terms of the number of of right equivalence classes of similar to the proof of Proposition 3.2.
4. Random counterexamples and EQ-learning
In section 2 we characterized learnability by equivalence queries in terms of Littlestone dimension and strong consistency dimension. The setting of equivalence query learning [2] as described in section 2 deals with worst-case bounds for algorithmic identification of concepts by a learner. In this section, we follow [4] and analyze a slightly different situation, in which the teacher selects the counterexamples at random, and we seek to bound the expected number of queries. [4] worked specifically with concept classes coming from boolean matrices, which was convenient for their notation. Our formulation is equivalent, but we use slightly different notation.
Throughout this section, let be a finite set, let be a set system on , and let be a probability measure on . For , let denote the symmetric difference of and .
Definition 4.1**.**
We denote, by for and , the set system For and , we let
[TABLE]
For any either or has Littlestone dimension strictly less than that of and so:
Lemma 4.2**.**
For and with
[TABLE]
Next, we define a directed graph which is similar to the elimination graph of [4].
Definition 4.3**.**
We define the thicket query graph to be the weighted directed graph on vertex set such that the directed edge from to has weight equal to the expected value of over with respect to the distribution 111111Here one should think of the query by the learner as being , and the actual hypothesis being . The teacher samples from , and the learner now knows the value of the hypothesis on .
Definition 4.4**.**
The query rank of is defined as:
Lemma 4.5**.**
For any ,
Proof.
Noting that and using Lemma 4.2:
[TABLE]
∎
Definition 4.6**.**
[4, Definition 14] Let be a weighted directed graph and A deficient -cycle in is a sequence of distinct vertices such that for all , with strict inequality for at least one .
The next result is similar to Theorems 16 (the case ) and Theorem 17 (the case ) of [4], but our proof is rather different (note that the case follows easily from Lemma 4.5).
Theorem 4.7**.**
The thicket query graph has no degenerate -cycles for
The analogue of Theorem 16 can be adapted in a very similar manner to the technique employed by [4]. However, the analogue of the proof of Theorem 17 falls apart in our context; the reason is that Lemma 4.2 is analogous to Lemma 6 of [4] (and Lemma 4.5 is analogous to Lemma 13 of [4]), but our lemmas involve inequalities instead of equations. The inductive technique of [4, Theorem 17] is to shorten degenerate cycles by considering the weights of a particular edge in the elimination graph along with the weight of the edge in the opposite direction. Since one of those weights being large forces the other to be small (by the equalities of their lemmas), the induction naturally separates into two useful cases. In our thicket query graph, things are much less tightly constrained - one weight of an edge being large does not force the weight of the edge in the opposite direction to be small. However, the technique employed in our proof seems to be flexible enough to adapt to prove Theorems 16 and 17 of [4].
Proof.
Suppose the vertices in the degenerate -cycle are .
By the definition of degenerate cycles and we have, for each , that
[TABLE]
so clearing the denominator we have
[TABLE]
Note that throughout this argument, the coefficients are being calculated modulo . Notice that for at least one value of , the inequality in 4.1 must be strict.
Let be a partition of
[TABLE]
Now define
[TABLE]
The following fact follows from the definition of and .
Fact 4.8**.**
The set is the disjoint union, over all partitions of into two pieces such that and of the sets
Now, take the sum of the inequalities 4.1 as ranges from to . On the LHS of the resulting sum, we obtain
[TABLE]
On the RHS of the resulting sum we obtain
[TABLE]
Given a partition of we note that the term appears exactly once as an element of the above sum for a fixed value of exactly when and or and
Consider the partition of . Suppose that is a block of elements each contained in , and that are in . Now consider the terms and of the above sums (each of which where appears).
On the left hand side, we have and . Note that for , we have So, by Lemma 4.2, we have
[TABLE]
On the RHS, we have
[TABLE]
For each a partition of , the terms appearing in the above sum occur in pairs as above by Fact 4.8, and so, we have the the LHS is at least as large as the RHS of the sum of inequalities 4.1, which is impossible, since one of the inequalities must have been strict by our degenerate cycle. ∎
Theorem 4.9**.**
There is at least one element with query rank at least .
Proof.
If not, then for every element , there is some element such that . So, pick, for each , an element such that Now, fix and consider the sequence of elements of given by ; since is finite, at some point the sequence repeats itself. So, take a list of elements . By construction, this yields a bad cycle, contradicting Theorem 4.7. ∎
4.1. The thicket max-min algorithm
In this subsection we show how to use the lower bound on query rank proved in Theorem 4.9 to give an algorithm which yields the correct concept in linearly (in the Littlestone dimension) many queries from . The approach is fairly straightforward—essentially the learner repeatedly queries the highest query rank concept. The approach is similar to that taken in [4, Section 5] but with query rank in place of their notion of informative.
Now we informally describe the thicket max-min-algorithm. At stage , the learner is given information of a concept class The learner picks the query
[TABLE]
The algorithm halts if the learner has picked the actual concept . If not, the teacher returns a random element at which point the learner knows the value of Then
[TABLE]
Let be the expected number of queries before the learner correctly identifies the target concept.
Theorem 4.10**.**
The expected number of queries to learn a concept in a class is less than or equal to
Proof.
The expected drop in the Littlestone dimension of the concept class induced by any query before the algorithm terminates is at least by Theorem 4.9; so the probability that the drop in the Littlestone dimension is positive is at least for any given query. So, from queries, one expects at least drops in Littlestone dimension. ∎
We give a rough bound on the probability that the algorithm has not terminated after a certain number of queries. Since a query can reduce the Littlestone dimension of the induced concept class by at most and the expected drop is at least , the probability that a query reduces the Littlestone dimension is at least . Then the probability that the Littlestone dimension of the induced concept class after queries is positive is at most the probability of fewer than many successes in the binomial distribution with probability and trials. It follows by Hoeffding’s inequality that the probability that the algorithm has not terminated after steps is at most
[TABLE]
5. Compression schemes and stability
In this section, we follow the notation and definitions given in [14] on compression schemes, a notion due to Littlestone and Warmuth [19]. Roughly speaking, admits a -dimensional compression scheme if, given any finite subset of and some , there is a way of encoding the set with only -many elements of in such a way that can be recovered. We will give a formal definition, but we note that numerous variants of this idea appear throughout the literature. For instance:
- •
Size -array compression [9].
- •
Extended compression schemes with extra bits [13].
The next definition, which is the notion of compression we will work with in this section is equivalent to the notion of a -compression with extra bits (of Floyd and Warmuth) [17, see Proposition 2.1].
Definition 5.1**.**
We say that a concept class has an -compression if there is a compression function and a finite set of reconstruction functions such that for any
- (1)
2. (2)
for at least one
We work with the above notion mainly because it is the notion used in [14], and our goal is to improve a result of Laskowski appearing there [14, Theorem 4.1.3]. In [17], Laskowski and Johnson prove that the concept class corresponding to a stable formula has an extended -compression for some . The precise value of is not determined, but was conjectured to be the Littlestone dimension. A later unpublished result of Laskowski appearing as [14, Theorem 4.1.3] in fact showed that one could take equal to the Shelah 2-rank (Littlestone dimension) and uses many reconstruction functions. In Theorem 5.4, we will show that many reconstruction functions suffice.
The question of Johnson and Laskowski is the analogue (for Littlestone dimension) of a well-known open question from VC-theory [13]: is there a bound linear in such that every class of VC-dimension has a compression scheme of size at most ? In general there is known to be bound which is at most exponential in [22].
Definition 5.2**.**
Suppose . Given a partial function , say that is exceptional for if for all ,
[TABLE]
has Littlestone dimension .
Definition 5.3**.**
Suppose . Let be the partial function given by
[TABLE]
It is clear that extends any partial function exceptional for .
Theorem 5.4**.**
Any concept class of Littlestone dimension has an extended -compression with -many reconstruction functions.
Proof.
If , then is a singleton, and one reconstruction function suffices. So we may assume .
Fix some with domain . We will run an algorithm to construct a tuple of length at most from by adding one element at each step of the algorithm. During each step of the algorithm, we also have a concept class , with initially.
If is exceptional in , then the algorithm halts. Otherwise, pick either:
- •
such that and
[TABLE]
has Littlestone dimension less than . In this case, set
- •
such that and
[TABLE]
has Littlestone dimension less than . In this case, set
We allow the algorithm to run for at most steps. There are two distinct cases. If our algorithm has run for steps, let be the tuple of all of the elements as above followed by all of the elements as above for . By choice of and , this tuple consists of distinct elements. By construction the set
[TABLE]
has Littlestone dimension [math], that is, there is a unique concept in this class. So, given consisting of distinct elements, for , we let be some belonging to
[TABLE]
if such a exists. By construction, for some , the Littlestone dimension of the concept class is zero, and so is uniquely specified and will extend .
We handle cases where the algorithm halts early by augmenting two of the reconstruction functions and defined above. Because and have so far only been defined for tuples consisting of distinct elements, we can extend these to handle exceptional cases by generating tuples with duplicate elements.
If the algorithm stops at some step , then it has generated a tuple of length consisting of some elements and some elements . Let consist of the elements chosen during the algorithm, and let consist of the elements chosen during the running of the algorithm. Observe that is exceptional for .
If is not empty, with initial element , then let . From this tuple, one can recover (assuming is nonempty), so we let be some total function extending , which itself extends . So extends whenever the algorithm halts before step is completed and some was chosen at some point. If is empty, then let , where is the initial element of . From this tuple, one can recover (assuming is empty), so we let be total function extending , which itself extends . Finally, if the algorithm terminates during step 1, then it has generated the empty tuple. In this case, let for some . Then for some . In particular, if we have defined above for some where the algorithm only returns (rather than the empty tuple), then , and so any such is handled by . So we may overwrite to set to be a total function extending , which itself extends . For any tuple output by our algorithm, one of the reconstruction functions produces an extension of the original concept.
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Dana Angluin. Learning regular sets from queries and counterexamples. Information and computation , 75(2):87–106, 1987.
- 2[2] Dana Angluin. Queries and concept learning. Machine learning , 2(4):319–342, 1988.
- 3[3] Dana Angluin. Negative results for equivalence queries. Machine Learning , 5(2):121–150, 1990.
- 4[4] Dana Angluin and Tyler Dohrn. The power of random counterexamples. In International Conference on Algorithmic Learning Theory , pages 452–465, 2017.
- 5[5] Dana Angluin and Dana Fisman. Learning regular omega languages. Theoretical Computer Science , 650:57–72, 2016.
- 6[6] Dana Angluin and Dana Fisman. Regular omega-languages with an informative right congruence. ar Xiv preprint ar Xiv:1809.03108 , 2018.
- 7[7] Peter Auer and Philip M Long. Simulating access to hidden information while learning. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing , pages 263–272. ACM, 1994.
- 8[8] José L. Balcázar, Jorge Castro, David Guijarro, and Hans-Ulrich Simon. The consistency dimension and distribution-dependent learning from queries. Theoretical Computer Science , 288(2):197–215, 2002.
