Exact Learning from an Honest Teacher That Answers Membership Queries
Nader H. Bshouty

TL;DR
This paper surveys methods for exactly learning functions from an honest teacher through membership queries, highlighting known results, techniques, and open challenges in the field.
Contribution
It provides a comprehensive overview of existing literature, techniques, and open problems in exact learning from membership queries.
Findings
Summarizes key results in exact learning from membership queries.
Discusses various techniques used in the literature.
Identifies open problems and future research directions.
Abstract
Given a teacher that holds a function from some class of functions . The teacher can receive from the learner an element~ in the domain (a query) and returns the value of the function in , . The learner goal is to find with a minimum number of queries, optimal time complexity, and optimal resources. In this survey, we present some of the results known from the literature, different techniques used, some new problems, and open problems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: Technion, Haifa, Israel
Exact Learning from an Honest Teacher
That Answers Membership Queries
Nader H. Bshouty
Abstract
Given a teacher that holds a function from some class of functions . The teacher can receive from the learner an element in the domain (a query) and returns the value of the function in , . The learner goal is to find with a minimum number of queries, optimal time complexity, and optimal resources.
In this survey, we present some of the results known from the literature, different techniques used, some new problems, and open problems.
Contents
1 Introduction
Robert Dorfman’s paper in 1943 introduced the field of Group Testing. The motivation arose during the Second World War when the United States Public Health Service and the Selective service embarked upon a large scale project. The objective was to weed out all syphilitic men called up for induction. However, syphilis testing back then was expensive and testing every soldier individually would have been very cost heavy and inefficient. A basic breakdown of a test is: Draw sample from a given individual, perform required tests and determine the presence or absence of syphilis. Suppose we have soldiers. Then this method of testing leads to tests. Our goal is to achieve effective testing in a scenario where it does not make sense to test people to get (say) positives. The feasibility of a more effective testing scheme hinges on the following property. We can combine blood samples and test a combined sample together to check if at least one soldier has syphilis [277].
Let be the set of the soldiers and let be the set of the sick soldiers. Suppose we know that the number of sick soldiers, , is bounded by some integer . If is the set of soldiers for which their blood samples is combined, then the test is positive if and only if is not empty. Thus, we can regard the set of sick soldiers as a Boolean function and the answer of the test “Is is not empty” as if and only if . The goal is to identify the function (and therefore the sick soldiers) from a minimal number of substitutions (tests) and optimal time. We can also identify the set of soldiers with the set and regard each test as an assignment , where if and only if the th soldier blood is in the test. Then the set is the set of all possible tests. The set of sick soldiers corresponds to a Boolean function where and is the Boolean or (disjunction). So this problem is also equivalent to the problem of identifying, a hidden Boolean conjunction of up to variables, with a minimal number of substitutions and optimal time.
Another interesting problem is the problem of learning decision tree with a minimal number of queries. Let’s say one has a restaurant and she wants to learn each customer tastes preference in food. For every customer, she offers a sample of a meal that was never ordered by the customer before and then receives some feedback. The customer tastes preference depends on some attributes of the food. For example, “sweet”, “sour”, “salty”, “umami”, “bitter”, “greasy”, “hot” etc. Those are the attributes. The goal is to learn (find out) the customer tastes preference from a minimal number of samples. Each sample can be regarded as a set of attributes. The customer tastes preference is the objective function. This function depends on the attributes, and the value of the function is the customer feedback. In many cases, the target function can be described as a decision tree. See the example in Figure 1.
In the following subsection, we give a framework to the above problems and many other similar problems.
1.1 The Learning Model
Let the domain (instance space) be the set and the range be the set . Let be a class of representations of functions (target class, concept class). Given a teacher (black box, opponent player, responder) that holds a (target) function (concept) from the class . The learner (player, questioner) can ask the teacher *membership queries (for Boolean functions. i.e. ) or substitution queries *(for non-Boolean functions), i.e., it can send the teacher an element of the domain and the teacher returns . The learner knows . Our (the learner) ultimate goal is to write an (exact) learning algorithm that learns with a minimum number of queries and optimal resources. That is,
Input: The learning algorithm receives the input and has access to an oracle MQf that answers membership/substitution queries for the target function . 2. 2.
Query complexity: It asks the teacher a minimum number of membership/substitution queries. 3. 3.
Exact learning: It either learns (finds, outputs) such that is logically equivalent to , , (proper learning) or learns such that (non-proper learning from ). 4. 4.
Resources Complexity: It runs in linear/polynomial/optimal time complexity, optimal space complexity, an optimal number of random bits or/and other optimal resources.
The following decision problems are also considered in the literature
Equivalent test: Given two teachers that have two functions from each. Test whether the two functions are equivalent. 2. 2.
Identity test from : Given a teacher that has a function from . Given a function . Test whether . 3. 3.
Zero test: Given a teacher that has a function from . Test whether .
The number of queries (query complexity) and the resources complexities are expressed as functions in and some other parameters that depend on the class being learned. In the literature, there are many other variations of the above problems, and we will mention some of them in this survey.
This problem has different names in different areas: Conditional and unconditional Tests [208], Combinatorial Search [177], Interpolation [75], Combinatorial Group Testing [113], Exact Learning from Membership Queries [2], Inferring [152], Identifying [146], Test Recognition [138], Active Learning [243], Reconstruction [195] and Guessing Game [273]. The decision problems are also called Testing, Functional Verification, Teaching, Hitting Set, and when is polynomial, it is called Black Box polynomial identity testing (PIT) [239, 255].
There are many other learning models, but, throughout this survey, when we say exact learning or learning we mean exact learning from membership queries or substitution queries only.
In this survey, we present some of the results known from the literature, different techniques used and some open problems.
1.2 Domain and Range
Throughout this survey, we will omit the subscript from and . In principle, the domain and the range can be any two sets, but since mathematical models can explain many natural phenomena, most of the sets considered in the literature are either finite or have some algebraic structure such as rings, fields, integers and real numbers.
Therefore, the domains and ranges considered in the literature are: The Boolean set that can be either , , or the binary field . The finite discrete set can be any finite set or a finite set with some algebraic structure such as the ring of integers modulo , or the finite field with elements ( is a power of prime). The infinite discrete set can be any countably infinite set such as the set of integers or the set of rational numbers . The infinite set (uncountable) can be any set with some algebraic structure such as the real numbers or the complex numbers . Also, the cartesian product of any finite number of the above sets is considered in the literature.
1.3 Classes of Functions
In this section, we will list the most studied classes in the literature, in different fields of computer science.
Boolean Function Classes: When the range of the function is we call the function Boolean function. Here we will consider classes of Boolean functions when the domain is . For any class defined below when we say that is , we mean that . Abusing the terminology, every function is regarded as a representation of the function (formula) and as a function, and we will use both interchangeably.
The most studied classes in the literature are:
Variable (Var): The class Var is the class of functions , where for , . We also define Lit the class of literals. Here is the logic negation of .
Learning the class Var is equivalent to playing the Rényi-Ulam game, [224, 227, 265]. 2. 2.
-Monotone Clause (-MClause) and MClause: The class -MClause is the class of all functions where and such that if and only if for some . When then . Such function can also be expressed as a logic monotone clause where , and is the logic “or” function (disjunction). We denote -MClause by MClause.
Learning -MClause is equivalent to group testing, [111, 113, 114]. See many other equivalent problems in [226] and reference within. 3. 3.
-Clause and Clause: The class -Clause is the class of all functions where , and such that if and only if for some or for some . Such function can be expressed as a logic clause where , , and . We denote -Clause by Clause. 4. 4.
-Monotone Term (-MTerm), -Term (-Term), MTerm and Term: The same as the above classes, but replace with the logic “and” function (Conjunction). The functions in MTerm are sometimes called monomials, and the class MTerm is also denoted by Monomial. That is, a monomial is a conjunction of variables, i.e., where . Here we will sometimes use the arithmetic of the field for and write as .
For a class , the dual class of is the class
[TABLE]
Obviously, , -ClauseD=-Term and -MClauseD=-MTerm. 5. 5.
-XOR and XOR: The same as the -Term class, but replace with the logic exclusive or function . Here, we will instead use the arithmetic of the finite field . Since , every function in XOR is of the form where and . 6. 6.
-Junta: Let . A variable is said to be relevant in if there are two assignments such that , for all we have , and . The class -Junta is the class of all Boolean functions with at most relevant variable. This function can be represented by a truth table of size of all the relevant variables. 7. 7.
-MJunta: For two assignments we write if for every , . A Boolean function is monotone if for every two assignments , if then . It is easy to see that Monotone functions are closed under disjunction and conjunction. That is, if and are monotone functions then and are monotone functions.
The class -MJunta is the class of all monotone functions in -Junta. That is, the class of all monotone functions with at most relevant variables. 8. 8.
Decision Tree (DT): One of the important representations of Boolean functions is decision tree. A decision tree formula is defined as follows: The constant functions [math] and are decision trees. If and are decision trees then, for all ,
[TABLE]
is a decision tree (can also be expressed as or ). Every decision tree can be represented as a tree . If or [math] then is a node labeled with or [math], respectively. If [if then else ], then has a root labeled with and has two outgoing edges. The first edge is labeled with [math] and is pointing to the root of and the second is labeled with and is pointing to the root of . See Figure 2.
The depth of the decision tree is the depth of the tree . That is the number of edges of the longest path from the root to a leaf in a tree. The size of the decision tree is the number of leaves in , that is, the number nodes in that are labeled with [math] and .
Every Boolean function can be represented as a DT. The representation is not unique. The following are subclasses of DT.
- (a)
Depth Size Decision Tree (DTd,s): The class is the class of all decision trees of depth at most and size at most . 2. (b)
Depth Decision Tree (DTd): The class is the class of all decision trees of depth at most . That is, . 3. (c)
Monotone DT (MDTd,s, MDTd): functions in the above classes that are monotone. 4. (d)
Decision List (DL),[228]: functions DT where every internal node in is pointing to at least one leaf. 5. (e)
Depth -Decision List (-DL): -DL is a decision list of depth at most .
Learning decision tree is equivalent to solving problems in databases, decision table programming, concrete complexity theory, switching theory, pattern recognition, and taxonomy, [206], computer vision, [23].
Disjunctive Normal Form (DNF): A DNF is another important representation of Boolean function . A DNF formula is a formula of the form
[TABLE]
where each Term is a term. The size of is .
Every Boolean function can be represented as a DNF. The representation is not unique. It is easy to see that every decision tree of size can be represented as DNF of size at most .
The subclasses of DNF considered in the literature are
- (a)
-DNF: The class of DNFs with terms from -Term. 2. (b)
-term DNF: The class of DNFs with at most terms. 3. (c)
-term -DNF: The class of DNFs with at most terms each of which is an -Term. 4. (d)
Read-Once : Here is one of the above classes. Read-Once is the class of functions in where each variable appears at most once in . 5. (e)
Read-Twice, Read-Thrice, Read- : The class of functions in where each variable appears at most twice (resp. three times and times) in . 10. 10.
Monotone DNF (MDNF): The class MDNF is the class of DNF with monotone terms (i.e., terms in MTerm). Every monotone function (See the definition in item 7) has a monotone DNF representation. This representation is one of the most popular canonical structures for representing Boolean functions. If where each is a monomial and no two monomials and , satisfies , then we say that is a reduced monotone DNF. Every monotone Boolean function has a unique representation as a reduced monotone DNF [1]. This representation is uniquely determined by the minterms of the function. That is, the assignments where and flipping any entry that is in to [math] changes the value of the function to zero. Each minterm of corresponds, one-to-one, to a monomial in the reduced monotone DNF representation of . The following are subclasses of MDNF.
- (a)
-MDNF: The class of MDNFs with monomials of size at most . That is, terms from -MTerm. 2. (b)
MDNF: The class MDNF is -MDNF. 3. (c)
-term MDNF: The class of MDNFs with at most monomials. 4. (d)
-term -MDNF: The class MDNFs with at most monomials of size at most . 5. (e)
Read-Once, Read-Twice, Read-Thrice, Read- , where is one of the above classes, is the class of functions in where each variable appears at most once (resp. twice, three times and times) in .
Learning Monotone DNF and subclasses of Monotone DNF equivalent to problems in computational biology that arises in whole-genome shotgun sequencing, [11], and DNA phisical mapping, [144]. 11. 11.
Conjunctive Normal Form (CNF): The class CNF is the dual class (See the definition in item 4) of DNF (where is replaced with and vice versa). In a similar way as above we define the classes -MCNF, MCNF, -clause CNF, -clause MCNF, -clause -CNF and -clause -MCNF. 12. 12.
CDNF. The class of CDNF is the class of formulas of the form where is a DNF, is a CNF and . The size of as where is -term DNF and is -clause CNF.
The following are subclasses of CDNF
- (a)
CDNFs,t. The class of CDNFs,t is the class CDNF, , where is a DNF of size at most and is a CNF of size at most . 2. (b)
-CDNFs,t: The class of CDNFs,t where is -DNF of size at most and is -CNF of size at most . 3. (c)
-CDNF: The class CDNFs,t. 4. (d)
MCDNF: The class of Monotone CDNF. 5. (e)
MCDNFs,t: The class of Monotone CDNFs,t. 6. (f)
-MCDNF: The class of Monotone -CDNF.
Learning CDNF is equivalent to problems in data-mining, graph theory and reasoning and knowledge representation, [118]. 13. 13.
Boolean Multivariate Polynomial (BMP). The class BMP is the class of multivariate polynomials over the binary field . That is, a function of the form
[TABLE]
where each is a monomial. The size of is .
Every Boolean function can be represented as a BMP. The representation is unique. It is easy to see that every decision tree of size and depth can be represented as BMP of size at most .
- (a)
-BMP: The class of BMPs with monomials of size at most , i.e., in -MTerm. This class is also called the class of multivariate polynomial of degree over . 2. (b)
-monomial BMP: The class BMPs with at most monomials. This class is also called the class of sparse multivariate polynomial over . 3. (c)
-monomial -BMP: The class of BMPs with at most monomials of size at most . This class is also called the class of sparse multivariate polynomial of degree over . 14. 14.
XOR of Terms (XT): The class XT is the class of XOR of terms, where Term.
- (a)
-XT: The class of XTs with terms of size at most . 2. (b)
-term XT: The class of XTs with at most terms. 3. (c)
-term -XT: The class of XTs with at most terms of size at most .
Notice that XT with terms from MTerm is BMP. Since every term of size can be represented as -monomial -BMP, every -term -XT is -monomial -BMP. 15. 15.
Deterministic Finite Automaton (DFA),[210]: A DFA is a -tuple that can be also represented as a directed graph with labeled edges where is a finite set of states (the vertices), and is the start state. is a finite set of symbols called the alphabet. is the transition function . The edge in is labeled with if and only if . This transition function defines, for every string , a unique path in the graph (here, is the number of symbols in ) that starts from and for every , . We denote the final state in this path as . The set is the set of accept states.
Every DFA defines a Boolean function where if and only if . When then a DFA for the Boolean function is a DFA such that: for every we have if and only if . 16. 16.
Boolean Multiplicity Automata Function (BMAF),[244]: A Boolean Multiplicity Automata Function is a function of the form:
[TABLE]
where each is matrix that its entries are Boolean univariate polynomials in over , i.e., for , and . The size of a BMAF is defined as .
See [44] for other ways to represent this class. 17. 17.
Boolean Halfspace (Perceptron, Threshold) (BHS): A Boolean Halfspace is a function of the following form:
[TABLE]
where are real numbers. The constants are called the weights of the Halfspace, and is called the threshold. For we define
- (a)
BHS: The class of Boolean Halfspaces with weights . 2. (b)
-BHS: The class of functions in BHS with at most relevant variables. 18. 18.
Boolean Circuit (BC) and Boolean Formula (BF) A Boolean circuit over the set of variables is a directed acyclic graph where every node in it with indegree zero is called an input gate and is labeled by either a variable or a Boolean constant . Every other gate is either a node with indegree one and is labeled (unary NOT) or a node with indegree two and is labeled by either, (binary AND) or (binary OR). A Boolean formula is a circuit in which every gate has outdegree one.
The size of a Boolean circuit is the number of gates in it, and its depth is the length of the longest directed path in it.
- (a)
Monotone Boolean Circuit (MBC) and Monotone Boolean Formula (MBF) MBC and MBF are Boolean circuit and Boolean formula, respectively, with no gate. 2. (b)
Read Once Formula (ROF). The class of Boolean read-once formula. A Boolean read-once formula is a formula such that every input variable appears in at most one input gate. 3. (c)
Monotone Read Once Formula (MROF). The class of monotone read-once formula. 4. (d)
Read-Once, Read-Twice, Read-Thrice, Read- , where is one of the above classes, is the class of functions in where each variable appears at most once (resp. twice, three times and times) in .
See other classes in [1, 2, 17, 26, 33, 34, 58, 73, 92, 93, 110, 118, 132, 158, 251].
Here are relations between some of the classes mentioned above.
[TABLE]
For two classes and we write (written as in the above diagram) if every function in of size is equivalent to a function in of size .
As for functions that are not Boolean, the literature is poor in studying the exact learnability of classes of functions with finite discrete domain or/and range from membership queries only. On the other hand, there is a substantial body of literature on learning and testing arithmetic classes.
We now give some of the arithmetic classes defined in the literature
Arithmetic Classes: Arithmetic classes represent function where is an algebraic structure such as field or ring. For exact learning, the most investigated arithmetic classes in the literature are
-Linear Functions (-LF), where is an integer, and is the set of real numbers. An -LF is a function of the form where and for all . The class -LF is the class -LF and LF is the class -LF.
Learning -LF is equivalent to coin weighing problem [37] and signature coding problem [50]. 2. 2.
-Quadratic Functions (-QF), where is an integer and . A -QF is a function of the form where and is a symmetric matrix with at most non-zero entries from . The class -QF is the class -QF.
Learning -QF is equivalent to problems in molecular biology [55]. 3. 3.
Multivariate Polynomial (MP): Let be a field. A multivariate polynomial over is a function of the form
[TABLE]
where , and . The size of is . The term is called monomial. The monomial is called -monomial if . The multivariate polynomial is said to be of degree if for all , -sparse if and with -monomials if all its monomials are -monomials.
When the field is finite then every function can be represented as a multivariate polynomial. This fact is not true for infinite fields. 4. 4.
Multiplicity Automata Function: A Multiplicity Automata Function (MAF) over the field is a function of the form
[TABLE]
where each is matrix that its entries are linear functions in (i.e., where ) and . The size of a MAF is .
This class contains the class MP in a sense that every MP of size has a MAF of size .
See [44] for other representations of MAF. 5. 5.
Arithmetic Circuit (AC) and Arithmetic Formula (AF) An arithmetic circuit over the field and the set of variables is a directed acyclic graph where every node in it with indegree zero is called an input gate and is labeled by either a variable or a field element. Every other gate is labeled by either or , in the first case, it is a sum gate and in the second a product gate. An arithmetic formula is a circuit in which every gate has outdegree one.
The size of a circuit is the number of gates in it, and its depth is the length of the longest directed path in it. The degree of a circuit is equal to the degree of the polynomial output by the circuit. 6. 6.
Arithmetic Read-Once Formula (AROF). An arithmetic read-once formula is a formula such that every input variable appears in at most one input gate.
Here are relations between some of the classes we’ve defined
[TABLE]
See other classes in [43, 239, 240, 255] and references therein.
1.4 Learning Algorithms and Complexity
The learning algorithm can be sequential or parallel, deterministic or randomized and adaptive (AD), -round (-RAD) or non-adaptive (NAD).
In the adaptive algorithm, the queries can depend on the answers to the previous ones. In the non-adaptive algorithm they are independent of the previous one and; therefore, one can ask all the queries in one parallel step. We say that an adaptive algorithm is -round adaptive (-RAD) if it runs in stages where each stage is non-adaptive. That is, the queries may depend on the answers to the queries in the previous stages but independent on the answers to the queries of the current stage.
The randomized algorithm can be either Monte Carlo (MC) or Las Vegas (LV). A Monte Carlo algorithm is a randomized algorithm whose running time is deterministic, but whose output may be incorrect with probability at most . A Las Vegas algorithm is a randomized algorithm that always gives a correct hypothesis. That is, it always produces a hypothesis that is equivalent to the target function. The complexity of a Las Vegas algorithm is measured by the expected running time, the expected number of queries and the expected number of rounds.
The goal is to ask the minimum number of queries and minimize the running time and space complexity of the algorithm and/or other resources such as the number of processors (for parallel algorithms) or the number of random bits (for randomized algorithms).
1.5 Polynomially, Efficiently and Optimally Learnable
In this subsection and the next, we try to unify the different definitions used in the literature of the efficiency of the query complexity and time complexity of exact learning algorithms. We will use the following new terminologies defined below: “learnable”, “polynomially learnable” “efficiently learnable”, “almost optimally learnable” and “optimally learnable”.
Let be a class of functions. Let be the minimum number of membership queries that a learner, with unlimited computational power, needs to learn with algorithms of type . The algorithm type, , can be adaptive (AD), non-adaptive (NAD) or -round (-RAD). For example, we will use for the adaptive algorithm and for the non-adaptive algorithm. When the algorithm is randomized we also add, as a subscript, MC for Monte Carlo algorithms and LV for Las Vegas algorithms.
In complexity theory, a polynomial time algorithm is an algorithm that runs in polynomial time in the input size. In the exact learning model, the time complexity of learning the class is, at least, the query complexity, , which can be exponential in the target function size. Therefore, polynomial time learning algorithm for will be defined as a learning algorithm that asks queries and runs in time , where is the size of the elements in the domain . Such classes are called polynomially learnable or just learnable classes. This is the definition used in the literature for learnability of classes.
Since the time complexity of any learning algorithm for is at least we may say that learning algorithms that run in time are “efficient algorithms” in time. However, this is not true for the query complexity. We will argue here, by the following example, that the above definition of for the query complexity is not the best definition for query-efficiency of exact learning from membership queries.
Take for example the class -MClause. We will show in Subsection 4.5 that . Therefore, one would expect that a query-efficient learning algorithm for this class asks queries and not queries as defined above. The time complexity cannot be less than , so the definition of in the time complexity is passable.
Therefore, we will suggest the following definition for efficient learning. If the algorithm for learning asks queries (rather than ) and runs in time , then we call the class efficiently learnable111We will not use the term “polynomially learnable” for this case to avoid confusion with the definition in the literature..
Another concern with this new definition is that in many areas, (such as combinatorial group testing and game theory) membership query is considered to be very costly. Therefore, one must find polynomial time learning algorithms that ask a minimum number of queries. Therefore, we will introduce here two other definitions: If there is a learning algorithm for that asks queries and runs in time , then we call the class almost optimally learnable. If there is a learning algorithm for that asks queries and runs in time , then we call the class optimally learnable.
In many cases, the query complexity is a function of several parameters that are related to the class . For example, the query complexity of -MClause also depends on . We say that the query complexity of a learning algorithm is optimal (resp. almost optimal, efficient or polynomial) in some parameter if assuming the other parameters are constant, the query complexity of the algorithm is optimal (resp. almost optimal, efficient or polynomial). So a learning algorithm for -MClause that asks queries is efficient, optimal in and efficient in .
We say that the class is query-polynomially (resp. query-efficiently, almost query-optimally or query-optimally) learnable in time if the number of queries is as above (for polynomially, efficiently, almost optimally and optimally, respectively) but the time complexity is .
We summarize all the above definitions in the following table:
Terminology
Query Complexity
Time Complexity
Polynomially Learnable
or Learnable
Efficiently Learnable
Almost Optimally Learnable
Optimally Learnable
Optimally Learnable in
when the other parameters are constant
Query-Optimally Learnable
in time
1.6 Strongly Polynomially, Efficiently and Optimally Learnable
Let be a class of functions. Suppose there is an integer parameter and classes such that . We say that is strongly -polynomially learnable if there is a learning algorithm for such that, for every target function , the algorithm runs in time and asks at most queries. In the same way as in the above subsection we define strongly -efficiently learnable, strongly almost -optimally learnable, strongly -optimally learnable and * learnable in time *.
For example, it is known that -MClause. Obviously, MClause-MClause. For MClause, let be the minimum integer such that -MClause. That is, is the number of relevant variables of . The class MClause is adaptively strongly -optimally learnable if there is an adaptive algorithm that with a target function that is MClause, the algorithm runs in time and asks queries. That is, the algorithm runs in time even when is not known to the learner.
Recall that, for a function , we say that is relevant in if there is an assignment such that , where is the standard bases, and is the bitwise exclusive or. That is, if depends on the variable . We say that is irrelevant in if it is not relevant in . For a class , let be the set of all functions in that have at most relevant variables. Then . A strongly -efficiently learnable class will be called strongly attribute-efficiently learnable. The same definition applies for attribute-polynomially, almost attribute-optimally, attribute-optimally and attribute learnable in time .
The definition of “strongly attribute-efficient” in [62] is equivalent to our definition of “strongly attribute-optimally learnable in ”.
1.7 Testing Problems
The following problems are also considered in the literature
Equivalence Testing: Given two teachers where each one has a function from . The learner can ask each one membership queries. Test whether the two functions are equivalent. The minimum number of queries is denoted by .
For the non-adaptive algorithms, this is equivalent to constructing a set of assignments such that for every , there is such that . Such a set is called an equivalent test set or universal identification sequence [146]. Obviously, for deterministic algorithms,
[TABLE]
Also, it is easy to show that.
[TABLE]
We give a proof sketch of the latter for completeness
Proof
Let be an adaptive algorithm that learns . We run . Each time it asks a membership query , we ask both teachers that membership query. Since learns , for some assignment we get different answers. Therefore
Let be an adaptive algorithm for identity testing the class . Let be the teacher with the target function . We run with and a dummy teacher that always gives the same answer as as long as there is a function that is consistent with on all the answers to the queries. Finally, no other function is consistent with and is uniquely determined. Therefore ∎ 2. 2.
Identity Testing from and Teaching Dimension: Given a teacher that has a function from . Given a function . The learner has and can ask the teacher membership queries. Test whether .
A non-adaptive algorithm is equivalent to constructing, for every , a set of assignments such that for every , there is such that . Such a set is called an identity test set for with respect to .
Notice that if then an identity testing set for uniquely determines . If then an identity testing set for gives a proof that does not belong to . Therefore, we will also call this set membership test set.
The maximum, over all the functions , of the minimum size identity test set for is denoted by or . When the set is called a teaching set, and is denoted by or and is called the teaching dimension of the concept , [142, 151, 249].
We have . To show this fact, suppose is an algorithm that adaptively learns . Run the algorithm with the target . The set of all the queries asked is a teaching set for .
For studies in teaching dimension see the following and reference therein [142, 151, 157, 181, 216, 233, 242, 247, 249]. 3. 3.
Constant Testing: Given a teacher that has a function from . Test, with membership queries, whether is not a constant function. This is equivalent to constructing a set of assignments such that, for every non-constant function , there is such that . Such a set is called a constant test set for . denotes the minimum size constant test set for . 4. 4.
Zero Testing: Given a teacher that has a function from . Test whether with membership queries. This is equivalent to constructing a set of assignments such that, for every non-zero function , there is such that . Such a set is called a zero test set or a hitting set for . The minimum size hitting set for is denoted by . It is easy to show that
[TABLE]
and
[TABLE]
where and . For Boolean functions, and , where is the exclusive or operation.
See other results in the above references and the references therein.
2 Bounds on OPT for Boolean Functions and Algorithms
In this section, we give some bounds for for classes of Boolean functions and some exponential time algorithms that are query-efficient for any class .
2.1 OPT for Adaptive Algorithms
We first state the following information-theoretic lower bound for deterministic learning algorithm. Throughout the paper, we write for .
Lemma 1
Let be a class of Boolean functions. Any deterministic learning algorithm for must ask at least membership queries. That is
[TABLE]
In fact, the bound is also true for Monte Carlo and Las Vegas algorithms. See for example [12]
Lemma 2
Let be any class of Boolean functions. Any Monte Carlo (and therefore, Las Vegas) randomized learning algorithm that learns with probability at least must ask at least membership queries. That is
[TABLE]
We now give upper and lower bounds for using the following combinatorial measure that is defined in [158, 208]. Let be a class of Boolean functions . Let be any function. We say that a set is a specifying set for with respect to if
[TABLE]
That is, there is at most one concept in that is consistent with on . Denote the minimum size of a specifying set for with respect to . The extended teaching dimension of is
[TABLE]
Let be any function. For every , there is an assignment such that . Therefore is a specifying set for and . Therefore
[TABLE]
Notice that for every Boolean function if is a specifying set for and is a function that is consistent with on then adding an assignment to where gives an identity testing set for with respect to . Therefore,
[TABLE]
In [158, 208], Moshkov proves the following bounds. Here, we will give another proof that gives, asymptotically, the same upper bound for .
Lemma 3
[158, 208]** Let be any class of Boolean functions. Then
[TABLE]
and
[TABLE]
Proof
The second inequality in (4) follows from (3).
Consider the following algorithm. After the th query, the algorithm defines a set of all the functions that are consistent with the membership queries that were asked so far. Consider any . Now the algorithm searches for an assignment such that
[TABLE]
If such exists, then it asks the membership queries . Define . Obviously, in that case,
[TABLE]
If no such exists, then the algorithm finds a specifying set for , where “Majority” is the majority function. It then asks membership queries for all the assignments in . If the answers are consistent with on , then there is a unique concept consistent with the answers, and the algorithm outputs this concept. Otherwise, there is such that . It is easy to see that in that case
[TABLE]
Denote by the number of times when the algorithm is left without such an . Then the number rounds when it does find such an is
[TABLE]
and so the number of queries is upper bounded by
[TABLE]
When is chosen so that the last term becomes [math], then is roughly . In case of this , (5) becomes
[TABLE]
In [158, 208], Moshkov gives, for any two integers and , an example of a class where
[TABLE]
So the upper bound in the above lemma is the best possible.
See also the other dimensions, bounds, and techniques in [1, 2, 4, 19, 156, 213].
2.2 Constructing Adaptive Algorithms
Given a class of Boolean functions as an input. Can one construct an algorithm that learns with queries? Obviously, with unlimited computational power, this can be done so the question is: How close to can one get when polynomial time (or any other time) is allowed for the construction?
An exponential time algorithm follows from the following
[TABLE]
where . See also [18, 135]. Can one do it in time? Hyafil and Rivest, [168], show that the problem of finding is NP-Complete. Laber and Nogueira, [202], show that this problem does not admit an -approximation unless P=NP. The reduction of Laber and Nogueira, [202], of set cover to this problem with the inapproximability result of Dinur and Steurer [121] for set cover implies that it cannot be approximated to in polynomial time unless P=NP.
The query complexity of the algorithm in Lemma 3 is within a factor of
[TABLE]
from OPT. However, unfortunately, the problem of finding a minimum size specifying set for is NP-Hard, [9, 142, 249].
In [23], Arkin et al. gives the following algorithm: Let be the set of functions that are consistent with the answers to the first membership queries. The th membership query of the algorithm is an assignment that partitions the set as evenly as possible, that is, an assignment that maximizes . Arkin et al. show in [23] that the query complexity of this algorithm is within a factor of from for some . Moshkov in [209] gives the exact ratio of . This algorithm runs in time . In particular, we have the following result. Here, we will give a very simple proof
Lemma 4
[23]**. There is a learning algorithm that runs in time and learns with at most
[TABLE]
queries.
In particular, any class is adaptively query-efficiently learnable in time poly.
Proof
First, we have
[TABLE]
where . This follows from the fact that every membership query can eliminate at most functions from . Therefore, there is such that
[TABLE]
Thus, using as the first query, eliminates at least functions from . After the th query the number of functions that remain (consistent with the answers to the membership queries) is at most
[TABLE]
So the number of queries of the algorithm is at most . By Lemma 1, .∎
Bshouty et al., [47], show that using the NP-oracle all Boolean classes are efficiently learnable in randomized expected polynomial time. See other results in [47].
2.3 OPT for Non-Adaptive Algorithms
In this subsection, we give some bounds for non-adaptive learning classes Boolean functions. The following result follows from (1) and (2)
Lemma 5
We have
[TABLE] 2. 2.
* is equal to half the minimum size equivalent test set for .* 3. 3.
The set is an equivalent test set for if and only if is a hitting set for . That is,
[TABLE]
We now prove
Lemma 6
[142, 176]**. There is a learning algorithm that runs in time and finds a hitting set for of size at most
[TABLE]
In particular,
There is a non-adaptive learning algorithm that runs in time and learns using at most
[TABLE]
queries. 2. 2.
Any class is non-adaptively query-efficiently learnable in time poly.
Proof
Define for every the set of all the functions such that . Now the hitting set problem is equivalent to the set cover problem, i.e., find the minimal number of elements such that . It is known that the greedy algorithm that at each stage, chooses the set that contains the largest number of uncovered elements, achieves an approximation ratio of , [84].
Now (8) follows from (7), Lemma 1 and 3 in Lemma 5.∎
The above reduction shows that the problem of finding a small hitting set is equivalent to finding a small set cover, and therefore, the minimum hitting set problem cannot be approximated in polynomial time to within of factor of less than , [24, 121, 129, 234].
A Hitting set for is also a hitting set for except probably one function. This follows from the fact that if is a hitting set for then for each two distinct functions there is such that and therefore hits one of them. This implies that there is no learning algorithm that runs in time and non-adaptively learns with less than unless PNP.
2.4 OPT for Classes of Small VC-dimension
We have seen in the previous subsection that the query complexity of non-adaptive learning the class is equal to the minimum size hitting set for and finding a small hitting set is equivalent to finding a small set cover. We now give another way to construct a small hitting set for classes with small Vapnik-Chervonenkis () dimension. We first define the of a class .
For a class and a set , we say that is shattered by if for any there is a function such that for all we have and for all we have . The of a class , , is the maximum integer such that there is a set of size that is shattered by .
Another way to construct a hitting set for is by choosing a distribution on the domain and then repeatedly chooses elements according to the distribution until we get a hitting set. This approach can also be used to prove an upper bound for the hitting set size.
Lemma 7
Suppose one can define a distribution over such that for every , . Then a set of
[TABLE]
elements chosen according to the distribution is a hitting set for with probability at least .
In particular,
[TABLE]
Proof
The result follows from the -net theorem [169]. See also Chapter 13 in [28]. ∎
In particular, this also gives the following upper bound for
Lemma 8
Suppose one can define a distribution over such that for every we have . Then a set of elements chosen according to the distribution is an equivalent test set for with probability at least .
In particular,
[TABLE]
Brönnimann and Goodrich, [51], show that there is an algorithm that runs in time and finds a hitting set for of size at most
[TABLE]
See also [128].
3 Reductions
In this section, we give some reductions that change an existing algorithm to an algorithm with a better query complexity and an algorithm in another learning model to an algorithm that learns from membership queries. We will show in the sequel how to apply those reductions.
3.1 Reductions for Adaptive Algorithms
In this subsection, we show some reductions from one exact adaptive learning algorithm to another one. Those reductions change the query complexity to be optimal in some of the parameters of the class.
For a class of functions we say that is projection closed if for any , and we have . That is, projecting any variable to any value keeps the function in the class . We say that the class is embedding closed if for any and any we have . We note here that almost all the classes considered in the literature are projection and embedding closed.
Here we use the parameter for the number of relevant variables. For class , the class contains all the functions in with at most relevant variables. All the classes below are projection closed. It is easy to show that for projection closed class , and contains at least literals. Therefore, .
In [62], Blum et al. show
Lemma 9
[62]** Let be a class that is projection closed. If is adaptively learnable in time with queries and there is a constant testing set for of size that can be constructed in time then is adaptively learnable in time with
[TABLE]
queries.
In particular, if (when the other parameters of the class are constants), then is optimally learnable in .
The results in [57] show that if the class is also embedding closed then . Therefore, with Lemma 9, we get the query complexity
[TABLE]
When the class is also embedding closed, Bshouty and Hellerstein,[57], show
Lemma 10
[41, 57]** Let be a projection and embedding closed class. If is adaptively learnable in time with queries, then is adaptively learnable in time with
[TABLE]
queries.
In particular, if is learnable then is optimally learnable in .
For the randomized algorithm, we have
Lemma 11
[57]** Let be a projection and embedding closed class. If is adaptively learnable with a Monte Carlo algorithm in time with queries then is adaptively learnable with a Mote Carlo algorithm in time with
[TABLE]
queries.
3.2 Reductions for Strong Adaptive Algorithms
In some cases, one can achieve strong learning. The following result is from [62]. See the definition of strong learning in Subsection 1.6.
Lemma 12
[62]** Let be a class that is projection closed. If is adaptively learnable in time with queries and there is a constant testing set for of size that can be constructed in time then is adaptively strongly-attribute learnable in time with
[TABLE]
queries where is the number of relevant variables of the target.
In particular, if (when the other parameters of the class are constants), then is strongly attribute-optimally learnable in .
Since (the all zero and all one assignments) is a constant testing set for all the monotone functions, , in particular, we have
Lemma 13
[62]** Let be a projection closed class that contains monotone functions. If is adaptively learnable in time with queries, then is adaptively strongly attribute learnable in time with
[TABLE]
queries.
In particular, is strongly attribute-optimally learnable in .
So with the above results, one can change algorithms that learn to algorithms that optimally and strongly attribute-optimally learn in .
3.3 Reductions for Non-Adaptive Algorithms
In this subsection, we give a reduction from one exact non-adaptive learning algorithm to another one that changes the query complexity to be optimal in some of the parameters.
In [13], Abasi et al. gave the following reduction. We sketch the proof for completeness
Lemma 14
Let be an embedding closed class such that . Let be a class of Boolean functions and suppose there is an algorithm that, for input , finds the relevant variables of in time .
If is non-adaptively learnable from (the output hypothesis is from ) in time with membership queries then is non-adaptively learnable from in time with
[TABLE]
membership queries where is any integer.
In particular, is non-adaptively optimally learnable in .
Proof
We use a perfect hash family that map the variables to a new set of variables where . This family contains hash functions and ensures that for almost all the hash functions, different relevant variables of the target are mapped to different variables in . It also insures that for every non-relevant variable , almost all the hash functions map the relevant variables and to different variables in .
We learn for each hash function with membership queries (when possible) and use the majority rule to find the relevant variables and to recover the target function.∎
See also [49] for a reduction for the randomized non-adaptive learning.
3.4 Reductions from the Exact Learning Model
In this subsection and the next one, we give two other models and show some conditions in which learning in those models can be reduced to learning from membership queries only.
The first learning model is the exact learning model from membership and equivalence queries. In this model, the goal is to learn the target function exactly with membership queries and equivalence queries. In the Equivalence Query (EQ) model, [1], the learning algorithm sends the teacher a hypothesis from some class of hypothesis . The teacher answers “YES” if is equivalent to the target , otherwise, it provides the learner a counterexample, i.e., an assignment where . We say that a class is exactly learnable from from membership and equivalence queries in time , membership queries and equivalence queries if there is an algorithm that runs in time , asks at most membership queries and at most equivalence queries and outputs that is equivalent to .
There are several polynomial time exact learning algorithms available in the literature that learns from membership and equivalence queries for classes mentioned in this survey and others. Classes such as Monotone DNF [2], DFA [3], Conjunction of Horn Clauses [17], -term DNF [33, 34, 74, 182], read-twice DNF [26], CDNF [33], decision trees [33, 44], Boolean multivariate polynomial [251, 44], multiplicity automata [44], read-once formula [58] and geometric objects [35]. See also references therein and [253]. Some of the algorithms are proper (i.e., ) and others are non-proper.
The reduction from this model to the model of learning from membership queries only is done as follows.
Lemma 15
Let and be two classes of functions. Suppose is learnable from from membership and equivalence queries in time , membership queries and equivalence queries. Then
If for every an identity testing set for with respect to can be constructed in time and then is learnable in time with membership queries. 2. 2.
If an equivalent test set for can be constructed in time and then is learnable in time with membership queries.
Proof
In 1, the algorithm replaces each equivalence query for to membership queries to all the assignments in . If is consistent with on then and the algorithm outputs . Otherwise, there is such that and can be used as an answer to the equivalence query.
In 2, the algorithm first constructs and asks membership queries to all the assignments in . Then for each equivalence query with , it finds such that and returns the answer to the algorithm. If no counterexample exists then, it outputs .∎
A hardness result in learning a class from equivalent and membership queries does not imply hardness in learning from membership queries only. For example, the hardness result of learning read-thrice DNF, , in [21] cannot be used as a hardness result for this class in the exact learning model from membership queries. This is because, in our definition of efficiency, the query complexity is allowed to be polynomial in which might be exponential in . For example, it is easy to see that read-thrice DNF (even for one term) so this class is optimally learnable (by asking all the queries ). So hardness in learning in this model does not imply hardness in learning from membership queries. On the other hand, hardness results for proper learning, [10], do give hardness results for proper learning from membership queries.
3.5 Reductions from the PAC Learning Model
In this section, we provide a reduction from PAC-learning with membership queries to learning from membership queries only.
In the probably approximately correct learning model (PAC learning model), [268], with membership queries the teacher has a function from some class . The learner can ask the teacher membership queries and is required to learn a function that is -close (defined below) to the target with high probability. Let be a distribution on the domain . We say that an algorithm PAC-learns from with membership queries according to the distribution if the algorithm , for the input and , with probability at least , outputs a function that is -close to , i.e., .
There are many PAC learning algorithms with membership queries in the literature for classes mentioned in this survey and others. Classes such as decision trees and multivariate polynomials under distributions that support small terms, [68], DNF under the uniform distribution, [130, 172], constant depth circuits under the uniform distribution, [201] and intersections of halfspaces, [189]. If a class is learnable from equivalence and membership queries, then it is PAC-learnable with membership queries according to any distribution, [2].
We say that is a class of distance with respect to the distribution if for every , we have . The following lemma shows how to change a PAC-learning algorithm to a learning algorithm with membership queries.
Lemma 16
Let and be two classes of functions. Suppose is PAC-learnable with membership queries from according to the distribution in time and membership queries. If is a class of distance with respect to the distribution , then is randomized Monte Carlo learnable with membership queries in time .
Proof
Run the algorithm with . Let be the output. Then with probability at least we have . Since we also have and since is a class of distance we must have .∎
4 Learning -MClause and Group Testing
In this section, we give an example of a class that has been extensively studied in the literature. Consider the class -MClause. That is the class of monotone clauses with at most variables.
4.1 Group Testing and Applications
In group testing (or pooling design), the task is to determine the positive members , , of a set of objects by asking queries of the form “does the subset contain a positive object?” That is, “does ?”. A negative answer to this question informs the learner that all the items belonging to are non-positive. The aim of group testing is to identify the unknown subset using as few queries as possible.
Group testing was originally introduced as a potential approach to economical mass blood testing [111]. However it has been proven to be applicable in a variety of problems, including quality control in product testing [248], searching files in storage systems [190], sequential screening of experimental variables [197], efficient contention resolution algorithms for multiple-access communication [190, 274], data compression [166], and computation in the data stream model [104]. See a brief history and other applications in [85, 113, 114, 217] and references therein.
Group testing is equivalent to learning the class -MClause from membership queries. We have ; the target is and each query “does ?” is equivalent to a membership query with the assignment where if and only if .
4.2 Known Results for Learning -MClause
The following table summarizes the results known for the asymptotic number of membership queries for learning the class -MClause. We will assume that .
[TABLE]
There is a (folklore) deterministic adaptive algorithm that runs in polynomial time and asks queries. See Subsection 4.5. This implies all the results for the upper bounds in the fourth column of the above table. The number of functions in -MClause is
[TABLE]
and therefore by Lemma 2, any Monte Carlo learning algorithm for -MClause must ask at least queries. This implies all the lower bounds in the table except for the non-adaptive lower bound for the deterministic and Las Vegas algorithm that follows from [120]. See Subsection 4.4. The upper bound of
[TABLE]
for the Monte Carlo non-adaptive algorithm follows from a simple randomized argument. See Subsection 4.4. Porat and Rothschild, [226], gave the first polynomial time deterministic non-adaptive learning algorithm that asks queries. The deterministic upper bound for non-adaptive learning follows from a simple probabilistic argument. See Also [123, 124]. The last two results are the deterministic two-round algorithms with queries and the polynomial time deterministic two-round learning algorithm that asks queries. This follows from [56, 81, 170]. See Subsection 4.6.
In particular,
Theorem 4.1
We have
The class -MClause is adaptively optimally learnable. 2. 2.
The class -MClause is non-adaptively almost optimally learnable and optimally learnable in . 3. 3.
The class -MClause is MC randomized adaptively optimally learnable. 4. 4.
The class -MClause is two-round efficiently learnable, two-round optimally learnable in and two-round optimally learnable in .
4.3 Bounds for OPT(-MClause)
Consider a non-adaptive algorithm for learning -MClause and let be the queries asked by the algorithm. Consider the matrix such that its th row is . Let be the th column of . If the target function is , then the vector of all the answers to the queries is (bitwise or). Let be the set of all , for . That is, the set of all possible answers to the queries for all possible target functions. Since each vector in uniquely determines the target function, the matrix must satisfy the following property: For every , where and , we have
[TABLE]
A matrix that satisfies this property is called -separable matrix. Therefore, -Mclause is equal to the minimum such that a -separable matrix exists.
A matrix is called -disjunct if for every distinct columns there is a row such that and for all . It is easy to show that, [190],
[TABLE]
Therefore, it is enough to construct a -disjunct matrix. Using a probabilistic method, [28], it is easy to show that a -disjunct matrix exists where . Just take each entry in the matrix to be with probability and [math] with probability and show that for the probability that the matrix is not -disjunct is less than . This implies
[TABLE]
See also [123, 124, 219]. There is also an almost tight lower bound for [120, 131, 230]
[TABLE]
4.4 Non-Adaptive Learning -MClause
We now show that the class -MClause is non-adaptively learnable in polynomial time with queries. In [226], Porat and Rothschild gave the first polynomial time algorithm for constructing a -disjunct matrix of size where . Now the learning algorithm is as follows. We first use the Porat and Rothschild [226] algorithm to construct a -disjunct matrix of size in polynomial time. Set . Then for every query (row in ) if then for every where , we remove from . The remaining variables in are the variables that appear in the target. This is because if and then the variable will be removed from by the row that assign [math] to all , and one to . Such row exists since is -disjunct. On the other hand, for every row where for some we have and, therefore, no variable in the target is removed from . This gives a polynomial time algorithm that asks queries. Since the lower bound for the query complexity of non-adaptively learning the class -MClause is , the class -MClause is non-adaptively almost optimally learnable and non-adaptively optimally learnable in .
Closing the gap between the lower bound and upper bound is one of the longstanding open problem in group testing. Bshouty proved in [38, 39] that a lower bound of implies that for a power of prime one cannot simulate a black-box multiplication of elements in the finite field with black-box multiplications in . This is one of the hard problems in algebraic complexity.
For a randomized non-adaptive learning algorithm, just randomly choose assignments in where each is one with probability and zero with probability . Define . Then for each assignment that satisfies remove all from for which . It is easy to show that with probability at least , the variables that remain in are the variables of the target.
The problem of strongly attribute learnability of MClause, which is equivalent to the problem of group testing when is not known to the learner, was studied by Damaschke and Muhammad, [115, 116]. They show that for deterministic non-adaptive algorithms, determining the exact number of the relevant variables is as difficult as learning the target function. For randomized non-adaptive learning algorithms, they gave the upper bound of to approximate and the lower bound (with some constraints) of .
4.5 Adaptive Learning -MClause
In this subsection, we present the folklore algorithm for adaptively learning the class -MClause. The algorithm runs in polynomial time and has a query complexity that matches the lower bound and therefore -MClause is optimally learnable.
We first give the lower bound
Lemma 17
Any deterministic (or even randomized) algorithm for -MClause must ask at least queries.
Proof
Follows from Lemma 2 and the fact that .∎
We now give the folklore algorithm. Let be the target function. For a subset , define the assignment that is one in the entries that are in and [math] in the other entries. At the first stage, the algorithm defines a set . At stage , the algorithm has disjoint sets where for all . The algorithm at stage partitions each set , into two (almost) equal disjoint sets and asks two queries and . The sets that will survive to the following stage, , are the sets for which . Those will be assigned to . The algorithm stops when the sizes of those sets are . Then each will be holding an index of a variable in the target.
Obviously, throughout the algorithm we have and for all . The algorithm has at most stages. At each stage, it asks at most queries, and therefore, the total number of queries is
[TABLE]
A more precise analysis gives the upper bound
See also the algorithms in [61, 90, 119, 125, 254, 266] and references therein.
The above adaptive algorithm runs even if is unknown to the learner. Therefore, the class -MClause is adaptively strongly attribute-optimally learnable. This implies
Theorem 4.2
[160]** The class -MClause is strongly attribute-optimally learnable with queries.
For randomized adaptive algorithms see [117] and reference within. When is unknown, Cheng [82] shows that there is a randomized adaptive learning algorithm that asks queries and finds with probability at least .
4.6 Two-Round Learning
In [56], De Bonis et al. shows that there is a two-round adaptive algorithm for learning -MClause that asks queries. See also [88, 126]. This is asymptotically as efficient as the best fully adaptive learning algorithms. Therefore
[TABLE]
The algorithm uses -selector. A -selector is a Boolean matrix such that any columns contain at least distinct rows of Hamming weight . It is known that there is a -selector of size . This follows from the following simple probabilistic argument: randomly choose matrix where each entry is with probability and [math] with probability . Then show that the probability that the matrix is not a -selector is less than one.
Given a -selector, the algorithm is as follows. Let , be the target function. At the first round, the algorithm asks queries that are the rows of a -selector. Let . The algorithm then eliminates all the variables in where there is a query for which and . At the second round, for each variable (that was not eliminated in the first round) the algorithm asks the query where and for all . Then in the target if and only if .
Now we show that the number of variables that are not eliminated in the first round is at most . Suppose for the contrary that there are variables that are not eliminated in the first round. By the same argument as in Subsection 4.4, . By the property of -selectors and since there is an assignment where and for some . This implies that and and the variable was eliminated in the first round. This is a contradiction.
Indyk shows in [170] how to construct an explicit -selector of size . This construction gives a polynomial time learning algorithm for -MClause with queries. Therefore the class -MClause is two-round efficiently learnable and two-round optimally learnable in . Cheraghchi, [81], used recent results in extractors to prove that -MClause is two-round almost optimally learnable in . His algorithm asks queries.
4.7 Other Related Problems
The group testing with inhibitors (GTI) model was introduced in [133]. In this model, in addition to positive items and regular items, there is also a category of items called inhibitors. The inhibitors are the items that interfere with the test by hiding the presence of positive items. As a consequence, a test yields a positive feedback if and only if the tested pool contains one or more positives and no inhibitors. This problem is equivalent to learning functions of the form
[TABLE]
This problem is studied in [76, 94, 112, 133, 162, 167].
See other related problems in [86, 217] and references therein.
5 Learning -Term -Monotone DNF
Consider the class -term -MDNF. That is, the class of monotone DNF with monotone terms (monomials) where each term is of size at most . Torney, [262], first introduced the problem and gave some applications in molecular biology. In this section, we present some results known from the literature for learning this class.
5.1 Learning a Hypergraph and its Applications
A hypergraph is where is the set of vertices, and is the set of edges. The dimension or rank of the hypergraph is the cardinality of the largest set in . A hypergraph is called Sperner hypergraph if no edge is a subset of another. For a set , the edge-detecting queries is answered “Yes” or “No”, indicating whether contains all the vertices of at least one edge of . Learning the class -term -MDNF is equivalent to learning a Sperner hidden hypergraph of dimension at most with at most edges using edge-detecting queries [14].
This problem has many applications in chemical reactions, molecular biology, and genome sequencing. In chemical reactions, we are given a set of chemicals, some of which react and some which do not. When multiple chemicals are combined in one test tube, a reaction is detectable if and only if at least one set of the chemicals in the tube reacts. The goal is to identify which sets react using as few experiments as possible.
See [6, 11, 12, 14, 15, 42, 55, 87, 91, 94, 100, 114, 115, 122, 139, 144, 211, 212, 235, 262] for more details on the problem, learnability of subclasses of -term -MDNF and other applications. This problem is also called, “sets of positive subsets” [262] “complex group testing” [114, 211] and “group testing in hypergraph” [139].
In all of the above applications, the size of the terms is much smaller than the number of terms and both are much smaller than the number of vertices . Therefore, all the results in the literature, except [13], assumes that , although they do not mention this constraint explicitly. For ease of the presentation of the results, we will also adopt this constraint throughout this section.
5.2 Cover Free Families
One of the tools used in the literature for learning -term -MDNF is cover-free families (CFF). A -cover free family (-CFF), [190], is a set such that for every where and every of size there is such that for all and for all . Denote by the minimum size of such set. The lower bound in [123, 215, 260] is
[TABLE]
where
[TABLE]
It is known, [12], that a set of
[TABLE]
random vectors , where each is with probability , is a -CFF with probability at least .
It follows from [38, 41, 52, 134] that there is a polynomial time (in the size of the CFF) deterministic construction of -CFF of size
[TABLE]
where the is with respect to . When , the construction can be done in linear time [41, 52].
5.3 Non-Adaptive Learning -Term -MDNF
In this section, we give a non-adaptive learning algorithm for the class of -term -MDNF.
We first give a lower bound
Theorem 5.1
[13, 114]** Let . Any equivalent test set for -term -MDNF is -CFF and -CFF. Therefore, Any non-adaptive algorithm for learning -term -MDNF must ask at least
[TABLE]
queries.
In particular, when is constant, the number of queries is at least
[TABLE]
Proof
Consider any distinct . To be able to distinguish between the two functions and we must have an assignment that satisfies and . Therefore is -CFF.
To be able to distinguish between the two functions and we must have an assignment that satisfies and . Therefore is -CFF.∎
We now give a simple upper bound
Theorem 5.2
Any -CFF is an equivalent test set for -term -MDNF. Therefore, there is a non-adaptive learning algorithm for -term -MDNF that asks
[TABLE]
queries. In particular, when is constant,
[TABLE]
Proof
Let be a -CFF. Let be any two non-equivalent -term -MDNF. Suppose and where . Let be an assignment such that (w.l.o.g.) and . Then for all and for some . Let , . Then for every , , there is a variable in where and for all the variables in we have .
Now take such that , and . Such exists since is -CFF. Then we have and . This completes the proof.∎
The first explicit non-adaptive learning algorithm for -term -MDNF was given by Gao et al., [139]. They show that this class can be learned with -CFF. Given such a -CFF, the algorithm simply takes all the monomials of size at most that satisfy . It is easy to see that the disjunction of all such monomials is equivalent to the target function. Assuming a set of -CFF of size can be constructed in time , the above algorithm learns -term -MDNF with queries in time . This with (12) gives
Theorem 5.3
There is a non-adaptive learning algorithm for -term -MDNF that asks
[TABLE]
queries and runs in time.
When is constant, the algorithm asks
[TABLE]
queries and runs in time.
In particular, for constant , the class -term -MDNF is non-adaptively almost optimally learnable and optimally learnable in .
When we can use Lemma 14 to prove
Theorem 5.4
Let . There is a non-adaptive learning algorithm for -term -MDNF that asks
[TABLE]
queries and runs in time.
In particular, for for some , the class -term -MDNF is non-adaptively almost optimally learnable and optimally learnable in .
Proof
Follows from Lemma 14 and Theorem 5.3 and the fact that any -term -MDNF has at most relevant variables.∎
One can now use (11) in a straightforward manner to get a randomized non-adaptive algorithm with better time and query complexity. Recently, Abasi et al. [13], gave an almost optimal learning algorithm for all and .
5.4 Adaptive Learning -Term -MDNF
In this section, we give results on adaptive algorithms for learning -term -MDNF.
Adaptive algorithms for learning -term -MDNF is studied in [14, 15] and [12]. The information theoretic lower bound for this class is . Angluin and Chen gave in [15] the lower bound when and Abasi et al. gave in [12] the lower bound when . Angluin and Chen gave a polynomial time adaptive algorithm for learning -term -MDNF that asks queries. Therefore, the class -term -MDNF is adaptively optimally learnable. In [12] Abasi et al. gave a polynomial time learning algorithm for -term -MDNF that asks queries when and queries when . They also gave some randomized algorithms.
The following table summarizes the latest results: Det. and Rand. stand for deterministic algorithm and randomized algorithm, respectively.
[TABLE]
5.5 Learning Subclasses of -term -MDNF
Learning subclasses of graphs and hypergraphs from edge-detecting queries received considerable attention in the literature due to its diverse applications [14, 15, 42, 87]. This is equivalent to learning subclasses of -term -MDNF and -term -MDNF, respectively. Subclasses include Graphs of bounded degree [15], Hamiltonian cycles [87, 143, 144], matchings [11, 42, 87], stars [6, 87], cliques [6, 87], families of graphs that are closed under isomorphism [11] -uniform hypergraph and almost uniform hypergraph [15]. The class of Read-Once -MDNF is equivalent to learning matchings [11].
6 Learning Decision Tree
In this section we study the learnability of the class of Depth -DT (), i.e., the class of all decision trees of depth at most .
6.1 Bounds on
We say that a set of assignments is -universal set if for every and every there is an such that for all .
It is known that any -universal set is of size [191, 246]. The probabilistic method with the union bound gives the upper bound . The best known polynomial time, , construction gives an -universal set of size [218]. It is easy to show that a random uniform set of size is -universal set with probability at least . For , an -universal set of size can be constructed in polynomial time [218].
We now prove
Theorem 6.1
A set of assignments is a hitting set for if and only if is an -universal set. In particular
[TABLE]
Proof
Let be an -universal set. Let be a decision tree of depth at most . Consider a path from the root of to a leaf that is labeled with . Let , where is the leaf that is labeled with , is labeled with and the edge is labeled with . If is -universal set then there is such that for all and . Therefore, is a hitting set for .
The other direction follows from the fact that any term where and is a decision tree of depth . Recall that if and if . To hit we need an assignment that satisfies .∎
For equivalent test set, adaptive and non-adaptive learning, we prove
Theorem 6.2
We have
[TABLE]
Proof
Let be an adaptive algorithm that learns DTd. We run and answer [math] for all the membership queries. The algorithm must stop and return the function [math] as the target function. Let be the set of all the assignments asked in the membership queries. If is not -universal set, then there is and such that . Then the term is zero on all the assignments of , and we get a contradiction.
We now prove that and then the other results follow from 3 in Lemma 5.
For any two functions and the corresponding decision trees and , one can construct a tree for from and as follows. First is equivalent to the tree that is the same as where the labels in the leafs are flipped from [math] to and from to [math]. Now in the tree , replace each leaf that is labeled with [math] with the tree and each leaf that is labeled with with the tree . It is easy to show that this tree computes , and its depth is at most .∎
6.2 Adaptive Learning Decision Tree
The adaptive learnability of decision tree of depth follows from many papers [38, 39, 65, 68, 130, 172, 187, 251]. One of the powerful techniques used in the literature is the discrete Fourier transform DFT. In DFT, one regards the Boolean function as a real function in , represent it as a linear combination of orthonormal basis functions, and then learns the coefficients.
In [187], Kushilevitz and Mansour used this technique for learning the class of decision trees as follows. Consider the set where . It is easy to see that is an orthonormal basis for the set of functions . Therefore, every function can be represented as
[TABLE]
where . This representation is called the Fourier representation of and is called the Fourier coefficient of . It is easy to see that the Fourier coefficients of is where , and the expectation is over the uniform distribution on . So every coefficient can be estimated using Chernoff bound. It remains to show that for a decision tree of depth the number of nonzero Fourier coefficients is small, and they can be found exactly and efficiently222An efficient learning algorithm for decision tree of depth is one that asks queries and runs in time. See Section 1.5.. We demonstrate the algorithm with the help of the following simple example.
[TABLE]
[TABLE]
[TABLE]
Consider the decision tree in Figure 4: . In this example, the depth of is , and is a sum of terms of size . First notice that since the terms are disjoint (no two terms are equal to for the same assignment), the “+” operation can be replaced by the arithmetic “+” operation in . To change the values of the function to values, we take . In general, every decision tree of size and depth can be written as a sum (in ) of terms of size at most . Now take any term, say . Over the real numbers , we can express as and as . Then the term can be expressed as
[TABLE]
In general, every term of size has a Fourier representation that contains non-zero coefficients each is and where is the Hamming weight of , i.e, the number of ones in . Therefore, every decision tree of size and depth has a Fourier representation that contains at most non-zero Fourier coefficients, each has one of the values in . In fact, using Parseval’s identity, one can prove that and, therefore, the Fourier coefficients have values from . Also each non-zero coefficient satisfies . Now using Chernoff bound, for each assignment of weight at most , one can exactly find each coefficient with queries. The problem with this algorithm is that since the number of assignments that satisfies is the time complexity is exponential .
Kushilevitz and Mansour in [187] and Goldreich and Levin in [149] gave an adaptive algorithm that finds the non-zero coefficients in time and queries. Kushilevitz and Mansour algorithm (KM-algorithm) is based on the fact that for any we have
[TABLE]
where for and , . Notice that can be computed exactly (with high probability) with Chernoff bound. Now KM-algorithm uses divide and conquer technique with the above identity to find the non-zero coefficients in time. Notice that
[TABLE]
and therefore . The algorithm first computes and and lets . At some stage, it holds a set where for all and for all . Now for each it computes and . Since , at least one of them is not zero. Then it defines . Since the number of the non-zero Fourier coefficients of a decision tree of depth at most is less than , the number of elements in is less than . Notice that for
[TABLE]
and therefore, contains all the assignments for which . It is easy to see that this algorithm runs in time . Therefore
Theorem 6.3
[187]** There is an adaptive Monte Carlo learning algorithm that learns in time and membership queries.
Kushilevitz and Mansour use a derandomization technique to change the algorithm to deterministic. They prove
Theorem 6.4
[187]** There is an adaptive deterministic learning algorithm that learns in time and membership queries.
By Lemma 9, we have
Theorem 6.5
There is an adaptive deterministic learning algorithm that learns in time and membership queries.
In particular, is efficiently adaptively learnable and optimally adaptively learnable in .
Proof
We use Lemma 9. Since decision trees of depth have at most relevant variables, we can set . By Theorem 6.4, . An -universal set is a constant testing set for . See the proof of Theorem 6.2. The best known polynomial time, , construction gives an -universal set of size [218]. Therefore . Then the reduction in Lemma 9 gives a polynomial time adaptive learning algorithm that asks
[TABLE]
membership queries.∎
See other randomized algorithms in [68, 251] that use different techniques. The algorithm in [251] uses membership and equivalence queries, and it is easy to see that every equivalence query can be simulated by randomized membership queries.
6.3 Non-Adaptive Learning Decision Tree
In this subsection, we give a sketch of the results in [65, 149] and then of [130] that gave the first polynomial time Monte Carlo non-adaptive learning algorithm for .
The following is the result of Hofmeister in [159]
Lemma 18
[159]** There is a polynomial time deterministic non-adaptive algorithm for that asks membership queries.
In particular, there is a set of assignments of size that can be constructed in polynomial time and an algorithm such that: Given for some of weight at most , the algorithm finds the assignment in polynomial time.
The main idea of the learning algorithm of is to use pairwise independent assignments for estimating , rather than totally independent assignments. Since pairwise independent assignments can be generated with a small number of random bits, the problem is reduced to finding the Fourier coefficient of a function that depends on a small number of variable. Using those coefficients one can recover the assignments with large Fourier coefficients in . We now give a sketch of the algorithm and its correctness.
It is easy to see that for the function we have
[TABLE]
Therefore
[TABLE]
where if and otherwise. Therefore
[TABLE]
Assuming we know , to compute for some we only need to know the sign of . Notice that . Now assuming is not zero (and therefore ), to compute the sign of it is enough to use Bienayme-Chebyshev bound rather than Chernoff bound. That is, to estimate on pairwise independent assignments rather than totally independent assignments. To generate such assignments, consider a random uniform matrix over the binary field where . We will determine later. Then the assignments in the set are pairwise independent. Combining all the above ideas with Bienayme-Chebyshev bound we prove
Lemma 19
With probability at least we have
[TABLE]
where .
Proof
Let . Consider a random uniform and the random variable . We have and . Since are pairwise independent, by Bienayme-Chebyshev bound we get
[TABLE]
Therefore, with probability at least we have
[TABLE]
By (13), with probability at least we have
[TABLE]
Since
[TABLE]
The result follows.∎
Notice that is a function in variables. This is the key lemma. It shows that if , there is a positive probability that can be computed (modulo the ) using the sign of some Fourier coefficient of . Since depends on a small number of variable and a membership to can be simulated by a membership to (since ), all its Fourier coefficients can be easily found. Now if is computed for all , where is the set in Lemma 18, then can be found with Hofmeister’s algorithm. In the following, we give more details.
By Lemma 19 and using the union bound we have
Lemma 20
Let be as in Lemma 18 and . Let . For any , if , there is and such that with probability at least we have
[TABLE]
where .
Now since depends on variables we can find all its Fourier coefficients in time and membership queries. Therefore, in time and membership queries we can find
[TABLE]
for all and . If is not zero then and then by Lemma 20, with probability at least , some and satisfies . Then by Lemma 18, can be recovered in polynomial time. So from all and using the algorithm in Lemma 18 we find a set of assignments such that: if is not zero then with probability at least . This implies that on average, contains half of the assignments that correspond to the non-zero Fourier coefficients of . The size of is at most . Then we find the Fourier coefficient for all using Chernoff bound and the union bound with additional membership queries. We can repeat the above time to find all the non-zero Fourier coefficients of with probability at least .
Putting all the above ideas together, it follows that
Lemma 21
[130]**. There is a non-adaptive Monte Carlo learning algorithm that learns in polynomial time and membership queries.
By Lemma 14, we get
Lemma 22
There is a non-adaptive Monte Carlo learning algorithm that learns in polynomial time and membership queries.
In particular, is MC efficiently non-adaptively learnable and MC optimally non-adaptively learnable in .
Proof
We use Lemma 14. Since a decision tree of depth at most contains at most relevant variables, we can take . We take . By Lemma 14, . Then the number of membership queries is
[TABLE]
∎
A better query complexity can be obtained from the reduction in [49]. See the following Table.
The outputs of the above algorithms are the Fourier representation of the decision tree and, therefore, they are non-proper learning algorithms.
The following paper summarizes the current state of the art results in learning
[TABLE]
7 Other Results
In this section, we give some results for learning other Boolean classes, arithmetic classes.
7.1 Other Boolean Classes
-MTerm: This class is the dual of -MClause. That is,
[TABLE]
Any algorithm that learns a class can be converted to an algorithm that learns with the same query complexity. This can be done as follows: Algorithm runs algorithm and for each query that asks, algorithm asks the query . For each answer received by the teacher, algorithm returns the answer to . If algorithm outputs then algorithm outputs
[TABLE]
-Term (the dual of -Clause) We first recall the definition of -universal set and then show how to use it for learning -Term.
A -restriction problem [24, 38, 218] is a problem of the following form: Given , a length and a set of assignments. Find a set of small size such that: For any and there is such that .
When then is called -universal set. The lower bound for the size of -universal set is [191, 246]
[TABLE]
Using a simple probabilistic method, one can get the upper bound
[TABLE]
Also, a random uniform set of assignments in is, with probability at least , -universal set. The best known polynomial time (poly) construction for -universal set is of size
[TABLE]
[218]. For , a -universal set of size can be constructed in polynomial time [218].
Now consider the class of -Term. Let be an adaptive algorithm that learns this class. Suppose the target function is the zero term and let be the set of queries that the algorithm asks with this target. Then must satisfy the following property: For every and every there is such that . Otherwise, the algorithm cannot distinguish between the zero term and the term where and . This is because is also zero on all the assignments in . Therefore, must be an -universal set and then the query complexity of the algorithm is at least .
Now it is easy to see that any -universal set can be used to learn non-adaptively the class -Term. Just take all the positive assignments, i.e., the assignments such that , and find the entries that have the same value in all of them. This uniquely determines the term. Therefore
[TABLE]
This also gives a non-adaptive learning algorithm that asks queries and runs in time. Therefore the class -Term is non-adaptively almost optimally learnable.
XOR: The class XOR is of size and therefore, by Lemma 2, any adaptive learning algorithm for XOR must ask at least queries. Now the trivial algorithm that asks the queries , where is the assignment that is in entry and zero elsewhere, learns XOR. Therefore, the class XOR is optimally learnable.
-XOR: Since
[TABLE]
by Lemma 2, the lower bound for the number of queries for any randomized learning this class is . Uehara et al. gives in [266] an adaptive algorithm that learns -XOR in queries. Therefore -XOR is adaptively optimally learnable. Hofmeister gives in [159] a non-adaptive algorithm that learns -XOR in queries. Therefore -XOR is also non-adaptively optimally learnable.
-Junta: The class of -Juntas is studied by Damaschke in [107, 108, 109] and Bshouty and Costa in [49]. In [108], Damaschke shows that
[TABLE]
He then shows that -Junta is almost optimally learnable in and efficiently learnable in [107, 109]. Using Lemma 14 with this result, we get an algorithm that asks queries and runs in time . Therefore the class -Junta is almost optimally learnable. Bshouty and Costa, [49], close the above gap and showed that
[TABLE]
They also showed that randomness does not help improving the query complexity. See also other results for randomized algorithms in [49, 107], optimal algorithms for small with a constant number of rounds and bounds for the number of rounds in [49, 109].
The following is a simple adaptive learning algorithm [108]. First ask the queries of an -universal set. Then take any two assignment and such that . Then find a relevant variable by a binary search on the bits that differ between and . Let be a subset of the relevant variables that is found so far. To find another relevant variable, we search for two assignments and that give the same values for the variables in and . If no such assignments exist, then, is the set of all the relevant variables and then just learn the truth table over . Otherwise, the binary search between and gives a new relevant variable. It is easy to see that the query complexity of this algorithm is where is the size of the -universal set. This shows that
[TABLE]
and, therefore, the class -Junta is almost optimally adaptively learnable.
-MJunta: The results in [49, 108, 109, 220] show that
[TABLE]
and
[TABLE]
Using Lemma 14 with the result of Damaschke in [109], we get a non-adaptive learning algorithm for -MJunta that asks queries and runs in time . Therefore the class -MJunta is almost optimally non-adaptively learnable.
The class of -MJunta is studied in [154, 180] where the exact value
[TABLE]
was found. Now, by Lemma 13, -MJunta is adaptively learnable in time and queries. Thus, the class -MJunta is adaptively almost optimally learnable.
Decision Trees (DTd). See Section 6.
DNF: This class and its subclasses are not studied in the literature for the model of exact learning from membership queries only.
Monotone DNF: See Section 5.
CNF: The dual class of DNF.
CDNF: This class is not studied in the literature for the model of exact learning from membership queries only. Some non-optimal results can be achieved using the algorithm in [33] and the reductions in Subsection 3.4.
Monotone CDNF: The learnability of monotone CDNF is studied in [47, 110, 118]. Domingo, [110], show that the class of monotone CDNF is learnable with a polynomial number of queries in time where is the size of the monotone CDNF. That is, the size of the MDNF and MCNF of the target. In [118] Domingo et al. study the learnability of the class Read -MCDNF. This is the class of monotone CDNF functions where each variable appears at most times in its MDNF representation and any number of times in its MCNF representation . See also [110] for other subclasses of monotone CDNF that are learnable from membership queries. Bshouty et al., [47], show that the class of MCDNF and -CDNF are learnable from membership queries and the NP-oracle.
Boolean Multivariate Polynomial: The efficient randomized learnability of multivariate polynomial follows from [68]. All the other algorithms in the literature require asking membership queries from an extension field. See for example [147].
XT, DFA, BMAF, ROF, BC, BF. No results are known for exact learning of those classes from membership queries only, except for the trivial result that when all the variables are relevant then .
Boolean Halfspace (BHS): Hegedüs, [156], shows that BHS (with zero-one weights) are adaptively learnable in polynomial time with queries. He also gives a lower bound for the number of queries. Therefore, BHS is adaptively optimally learnable. See also [266]. Hegedüs and Indyk, [163], give a non-adaptive polynomial time learning algorithm for BHS that asks queries.
Abboud et al., [8], show that BHS (Boolean Halfspaces with weights in ) is constant-round learnable in time and queries. They also gave the lower bound . Abasi et al. [7] give a non-adaptive algorithm for BHS that asks and a two-round algorithm that asks queries and runs in time . Therefore, the class BHS is adaptively efficiently learnable.
Abboud et al. [8] give a lower bound for BH (Boolean Halfspaces with weights ). Therefore, BH is non-adaptive almost optimally learnable. Just ask all the queries.
Uehara et al. study some restricted classes of BHS, [266].
Shevchenko and Zolotykh [261] studied halfspace function over the domain when is fixed and no constraints on the coefficients. They gave the lower bound for learning this class from membership queries. Hegedüs [158] proves the upper bound . For fixed , Shevchenko and Zolotykh [276] gave a polynomial time algorithm (in ) for this class. Applying Theorem 3 in [158], the upper bound for the teaching dimension of a halfspace, [106], gives the upper bound .
MROF. A monotone Boolean read-once formula is a monotone formula such that every input variable appears in at most one input gate. Angluin et al. gave a polynomial time algorithm that learns MROF with queries [20, 164]. The best lower bound for the number of queries is the information theoretic lower bound that follows from Lemma 2.
Bshouty shows in [36] that MROF cannot be learned efficiently in parallel (poly time).
Other Classes: See classes of discrete functions and other classes in [34, 60, 146, 150, 153, 163, 261].
7.2 Classes of Arithmetic Functions
In this section, we give few results from the literature on learning arithmetic classes.
-Linear Functions (-LF).
The problem of learning LF is studied in [1, 77, 101, 127, 198, 199, 207, 250]. Many authors independently proved that it is optimally learnable with
[TABLE]
queries. They do not address the time complexity, although one can show that the constructions also give simple algorithms that run in polynomial time.
The class -LF is studied in [37, 78, 79, 113, 145, 200, 204, 264, 266]. It is shown that
[TABLE]
Note here that in the literature they use to mean . In [37], Bshouty shows that it is optimally adaptively learnable. The problem is still open for the non-adaptive learning.
The problem of learning -LF is studied in [54, 69, 70, 71, 83, 97, 140, 171, 179]. Bshouty and Mazzawi, [70], show that
[TABLE]
The results are derived from non-constructive probabilistic proofs. All the learning algorithms for this class are either for restricted subclasses or randomized algorithms with success probability that depend on or non-optimal.
See other subclasses in [37, 145, 225, 236]. Similar problems are studied in other areas such as coding theory [229] compressed sensing [183] Multiple Access Channels [50] (e.g., adder channels [98]) and combinatorial group testing [113, 114] (e.g., coin weighing problem [37]).
-Quadratic Functions (-QF).
This problem is equivalent to learning a weighted graph from additive queries, [144], where, for an additive query, one chooses a set of vertices and asks the sum of the weights of edges with both ends in the set.
The -QF was studied in [1, 96, 97, 136, 144, 145, 205, 235]. The -QF for different was studied in [70, 71, 72, 83, 97]. Bshouty and Mazzawi, [70], proved that
[TABLE]
The results are derived from non-constructive probabilistic proofs. For the positive real numbers , Bshouty and Mazzawi gave in [72] a polynomial time algorithm that adaptively learns the class with queries. This is the only known deterministic adaptive algorithm that runs in polynomial time. Choi, [83], gave a polynomial time randomized adaptive learning algorithm for that asks queries.
Bshouty and Mazzawi extended some of the above results to multilinear forms of constant degree [69].
Multivariate Polynomial: This class has been extensively studied in the literature. Ben-Or and Tiwari [75] gave the first deterministic non-adaptive polynomial time learning algorithm for sparse multivariate polynomial over a large field with an optimal number of queries. See also [141, 147, 148, 185, 193].
For identity testing and zero testing of sparse multivariate polynomials see [32, 64, 148, 155, 192, 193, 272] and references therein.
Multiplicity Automata Function: This class was first defined and studied in [44]. It is efficiently learnable from queries with a randomized MC algorithm [44].
Arithmetic Circuit and Arithmetic Formula: In [267] Valiant suggests an algebraic analog of P vs. NP, the VP vs. VNP problem. A multivariate polynomial family is in VP if there exists a constant such that for all , deg and has a circuit of size bounded by . Polynomial family is in VNP if there exists a family VP such that for every
[TABLE]
Valiant shows in [267] that permanent is complete for VNP, i.e., for every polynomial family in VNP, there is a constant such that for every , can be expressed as permanent of a matrix of size . It is believed that VPVNP. This remains an outstanding open problem.
In [31], Agrawal and Vinay show that if there exists a deterministic polynomial time zero testing for arithmetic circuits of degree and depth then there exists a polynomial family , computable in exponential time, that is not in VP. So an efficient deterministic zero testing for such circuits leads to a proof of circuit subexponential lower bounds that may be beyond our proof techniques.
Kabanets and Impagliazzo show in [184] that even if the zero testing algorithm gets the arithmetic circuit as an input (white box) if there exists a deterministic polynomial time algorithm for zero testing for VP then either NEXPP/poly or VPVNP. Therefore, any deterministic algorithm implies solving outstanding open problems in complexity. See [29, 255] for other negative results.
On the other hand, the following Schwartz-Zippel lemma, [241, 275], gives a very simple MC randomized optimal zero testing algorithm for any arithmetic circuit with a bounded degree
Lemma 23
(Schwartz-Zippel)* Let be any non-zero polynomial of degree and . Then for selected randomly uniformly from we have*
[TABLE]
For the deterministic identity testing of arithmetic circuits of depth , restricted depth circuits, circuits that compute sparse polynomials and other restricted circuits see the results in [25, 29, 30, 63, 89, 99, 188, 194, 196, 238, 239, 240, 252, 255, 256, 258, 270] and references therein. Some other results in the literature investigate the problem of minimizing the number of random bits used for identity testing. See for example [32, 64, 193].
Arithmetic Read-Once Formulas (AROF): Arithmetic Read-Once Formula is a formula where each variable appears at most once. In [59] Bshouty et al. gave an MC randomized polynomial time algorithm for AROF (with the division operation) over a large enough field . In [48] Bshouty and Cleve gave a polynomial time (poly(log)) randomized parallel algorithm for this class. In [43], Bshouty and Bshouty extended the result of [59] to include the exponentiation operation. Shpilka and Volkovich in [257] gave a deterministic algorithm for learning depth AROF in time . In [256] Shpilka and Volkovich gave a deterministic learning algorithm for AROF that asks queries. They also studied the class of sum of AROFs. Recently, Volkovich gave in [214] a polynomial time algorithm for learning any AROF.
Other Classes: See other results and other classes in [1, 29, 99, 196, 232, 239, 238, 240, 256, 252, 255, 257, 258] and references therein.
8 Non-Honest Teacher
Although the aim of this survey is to summarize the results of learning from an honest teacher, we feel a need to give here some of the models of non-honest teacher and some results.
8.1 Models of Non-Honest Teacher
In this survey, the teacher model is the honest teacher model where with a query , the teacher answers .
For non-honest teacher, there are many models. One can consider a persistent teacher [27, 223] or a non-persistent teacher. For persistent teacher (or permanently faulty [223]) if the answer to the query is then no matter how many times the learner asks the same query the answer will be . A non-persistent teacher is a teacher that is not persistent. In the literature the following non-honest teacher models are considered (each one can be either persistent or nonpersistent):
Incomplete Model [27]: The incomplete teacher, with a query , answers with probability and answers “” (I DON’T KNOW) with probability . In the persistent model, repeated queries to will give the same answer with probability . In the non-adaptive model, the learner knows or some upper bound for . 2. 2.
Malicious Model [186, 227, 237, 269]: (Also called random error [227] and classification noise [172]) The malicious teacher, with a query , answers with probability and gives an arbitrary/random wrong answer with probability . The learner knows or some upper bound for . 3. 3.
Limited Incomplete Model [22]: The limited incomplete teacher gives answers “” (I DON’T KNOW) to at most queries of its choice. In the non-adaptive model, the learner knows or some upper bound for . 4. 4.
Limited Malicious Model [22, 265]: (Also called the constant number of error model [16, 231]) The limited malicious teacher gives arbitrary/random wrong answers to at most queries of its choice. The learner knows or some upper bound for . 5. 5.
Prefix-Bounded Error Fraction Model [222]: (Also called linearly bounded model [16]) In the adaptive model, the teacher after queries can give at most wrong answers. In the -round model, at each round with queries and for any , the learner can give wrong answers to the first queries in this round. The learner knows or some upper bound for . 6. 6.
Globally Bounded Error Fraction Model [222]: In the adaptive model, if the algorithm asks queries then the teacher can give at most wrong answers. In the -round model, at each round with queries, the learner can give at most wrong answers. The learner knows or some upper bound for .
Notice that in the globally bounded error fraction model the first queries can be all wrong while in the prefix-bounded error fraction model only queries of the first queries can be wrong. 7. 7.
Incomplete Prefix-Bounded Error Fraction Model: In the adaptive model, the teacher after queries can give at most “?” answers. In the -round model, at each round with queries and for any , the learner can give “?” answers to the first queries in this round. In the non-adaptive model, the learner knows or some upper bound for . 8. 8.
Incomplete Globally Bounded Error Fraction Model [37]: In the adaptive model, if the algorithm asks queries then the teacher can give at most “?” answers. In the -round model, at each round with queries, the learner can give at most “?” answers. In the non-adaptive model, the learner knows or some upper bound for . 9. 9.
-Sided Error Models: (Also called half-error [224], or one-sided error [231], for Boolean functions) Can be defined for any one of the above models where the wrong or“?” answers only applied when is in some set .
For the persistent model we define the output hypothesis to be equivalent to the target function if it agrees with the target function on all the elements of the domain except the ones for which the teacher answer “?” or gave a wrong answer.
8.2 Some Results in Learning with Non-honest Teacher
In this subsection, we give some results of learning with a non-honest teacher.
Adaptively learning Var in non-honest teacher model is equivalent to the problem of “searching with lies” [265]. Ulam [265] proposed the following game. Someone thinks of a number between one and one million (which is just less than ). Another person is allowed to ask up to twenty questions, to each of which the first person is supposed to answer only yes or no. Obviously, the number can be guessed by asking first: is the number in the first half-million? Then again reduce the reservoir of numbers in the next question by one-half, and so on. Finally, the number is obtained in less than questions. The number corresponds to the target variable in the class Var and each question “Is ?” corresponds to the query where if and only if .
Ulam asked the following question: Now suppose one were allowed to lie once or twice, then how many questions would one need to get the right answer? This problem is equivalent to learning the class Var in the limited malicious model. Rényi [227] asked a similar question and therefore, the game is called Rényi-Ulam game.
This problem is completely solved with an asymptotically optimal number of queries in the limited malicious model [1, 137, 221, 245]. See also the references in [224] for results when the number of lies is small. Learning this class in two-round is studied in [102, 103, 105].
The problem is solved with an asymptotically optimal number of queries in the linearly bounded model [16, 222, 259]. It is also noted by several authors that finding a non-adaptive algorithm in this model is equivalent to constructing a -error correcting code [224].
See the survey in [224] for results in other models of non-honest teacher.
For learning -MClause and -term -MDNF with non-honest teacher see [5, 16, 80, 81, 88, 94, 95, 113, 203, 231, 271] and references therein.
9 Problems and Open Problems
In this section, we give some problems and open problems
[TABLE]
Section 1
- 1.1.
In real life problems, the target function may change in time. Define a realistic learning model for learning functions that change in time. 2. 1.2.
In the results of this survey and almost all papers in the literature, the space complexities of learning algorithms are polynomial in which, for many classes , is exponential in and/or other parameters that depend on the class. It is interesting to investigate learning algorithms that use small space complexity. 3. 1.3.
It is interesting to minimize the number of random bits used in randomized learning algorithms. See for example item 4.8. 4. 1.4.
It is interesting to study the exact learnability of a random function in a class from membership queries. See, for example, some models in [173, 174, 263]. 5. 1.5.
An LV randomized non-adaptive algorithm with query complexity of complexity is an algorithm that asks at most queries and runs in expected time . So any LV randomized non-adaptive algorithm is deterministic in choosing the queries. We suggest the following definition that allows expected query complexity in non-adaptive learning algorithms: A weak LV randomized non-adaptive algorithm with complexity is a non-adaptive algorithm that (1) generates queries that are independent of the answers to the previous queries. (2) Finds the target function with probability . (3) The expected number of queries is and the expected time is . 6. 1.6.
In this survey, we have shown some results for the testing problems. Some of those results are not true for LV/MC randomized algorithms. For example, in deterministic algorithms, the query complexity of non-adaptive learning is equal to the minimum size equivalent test. For randomized algorithms, one can non-adaptively equivalent test the class XOR with random queries whereas learning XOR by a randomized algorithm takes at least queries. It is interesting to study MC and LV randomized equivalent test and other types of tests in the adaptive and non-adaptive model. 7. 1.7.
Investigate testing in the deterministic/randomized -round model. 8. 1.8.
There are very few results in the literature on parallel learning from membership queries. That is, learning in time. Study parallel learning. 9. 1.9.
To the best of my knowledge, all the Monte Carlo learning algorithms in the literature ignore minimizing the effect of the success probability in computing the number of queries. Some of the results even ignore by assuming that it is constant. It is interesting to investigate the role of in the query complexity. 10. 1.10.
We say that a non-adaptive algorithm is strongly nonadaptive if the queries are constructed by different learners (one query for each learner) without any communication between them. It is interesting to study this model or any model with minimum communication between the learners.
Section 2
- 2.1.
In the bound
[TABLE]
find some conditions on for which tighter bounds can be obtained. 2. 2.2.
Many lower bounds in the literature for are based on finding a subset of functions such that for each membership query there is an answer that eliminates at most small fraction of the functions. The best possible bound that one can get using this technique is denoted by . In [67] Bshouty and Makhoul show that . Find a new combinatorial measure that is a lower bound for and exceeds . 3. 2.3.
The algorithm in (6) runs in time where is the depth of the tree. Find an algorithm with a better exponential complexity. 4. 2.4.
Find a non-adaptive learning algorithm that runs in time and learns using at most queries for some . 5. 2.5.
Study the above bounds and find approximation algorithms for randomized adaptive learning. 6. 2.6.
Study bounds and find approximation algorithms for adaptive learning of classes with small VC-dimension. 7. 2.7.
Study the above bounds and find approximation algorithms for -round learning. 8. 2.8.
Is NP-oracle enough for deterministic/randomized optimal learnability? What other oracle gives learning with minimum number of membership queries? 9. 2.9.
In [53] some techniques were used in the model of exact learning from membership and equivalence queries to minimize the number of equivalence queries. Can those be used to find more query-efficient algorithms? 10. 2.10.
Study the above bounds and find approximation algorithms for classes with small extended teaching dimension. 11. 2.11.
Study bounds for LV and MC randomized algorithms.
Section 3
- 3.1.
The reductions in subsection 3.1 are for adaptive and non-adaptive learning. It is interesting to find reduction results for -round deterministic and randomized algorithms. 2. 3.2.
The reductions in subsection 3.1 are for the number of relevant variables. Find reductions for other parameters, for example, the number of terms (e.g. for MP or MDNF). 3. 3.3.
Find reductions that give algorithms that are optimally learnable or almost optimally learnable from . 4. 3.4.
Lemma 15 is implicitly used for some of the results in the literature for learning some classes. For example, the Halving algorithm is an algorithm that asks equivalence query with “Majority” at each stage, where are the functions in that are consistent with the counterexamples seen so far. Lemma 3 is just a reduction from the Halving algorithm. It is interesting to study learnability of the classes mentioned in this survey with this technique. 5. 3.5.
There are many polynomial time exact learning algorithms from membership and equivalence queries in the literature for classes mentioned in this survey and others. See [2, 3, 17, 26, 33, 34, 35, 44, 58, 74, 182, 251, 253]. It is interesting to study the reduction of those algorithms to learning from membership queries only when some of the parameters of the class is restricted. For example, can Angluin-Frazier-Pitt learning algorithm for conjunctions of horn clauses, [17], be changed to learning from membership queries when the number of terms is bounded by or/and the size of each clause is bounded by . 6. 3.6.
Let be a family of functions . For we say that is an -perfect hash family (-PHF) [24] if for every subset of size there is a hash function such that is injective (one-to-one) on , i.e., . In [41] it is shown that for . There is a -PHF of size that can be constructed in time . This construction is used for many reduction in learning. It is known that there is a -PHF of size . Finding a polynomial time construction for -PHF of such size improves the query complexity of many reductions.
Section 4
- 4.1.
Non-adaptive randomized algorithms have been proposed in [46, 66, 114, 127, 161, 165]. The following models are studied in the literature for constructing the random test matrix
- –
Random incidence design (RID algorithms). The entries in are chosen randomly and independently to be with probability and [math] with probability .
- –
Random -size design (RrSD algorithms). The rows in are chosen randomly and independently from the set of all vectors of weight .
- –
Random -set design (RkSD algorithms) The columns in are chosen randomly and independently from the set of all vectors of weight .
Find lower and upper bounds for the constant in of the number of membership queries for the above non-adaptive learning algorithms. 2. 4.2.
Find a polynomial time -round algorithm for learning -MClause that asks queries. 3. 4.3.
Find a deterministic non-adaptive learning algorithm for -MClause that asks queries. 4. 4.4.
A construction of a -disjunct matrix is called globally explicit construction if it is deterministic polynomial time in the size of the construction. A locally explicit construction is a construction where one can find any entry in the construction in deterministic poly-log time in the size of the construction. In particular, a locally explicit construction is also globally explicit. The constructions in the literature for -disjunct matrices are globally explicit constructions. Find a locally explicit construction of -disjunct matrix of size . 5. 4.5.
There are few results in the literature about learning -MClause when unknown to the learner. It is interesting to study this problem. 6. 4.6.
Let be a set of functions . Define -MClause() the set of all functions where and for all . Study the learnability of the class -MClause(). 7. 4.7.
Study the learnability of the class of monotone clauses with constant number of negated variables. 8. 4.8.
Any deterministic algorithm for non-adaptive learning -MClause has query complexity while there is a Monte Carlo non-adaptive learning algorithm that asks queries only and uses random bits. What is the tradeoff between the number of random bits and the query complexity?
Section 5
- 5.1.
Find strong learning algorithms for -term -MDNF with the parameter or/and . 2. 5.2.
Many results in the literature for learning sub-classes of -term -MDNF are query-efficient, but are not time-efficient. It is interesting to find polynomial time learning algorithms for those classes. 3. 5.3.
Find a non-adaptive efficient learning algorithm for the class -term -MDNF when . 4. 5.4.
Angluin and Chen gave in [15] a polynomial time -round Las Vegas algorithm for learning -term -MDNF that asks queries. Can this class be learned in -round with queries? 5. 5.5.
Find {\rm OPT}_{R-{\rm RAD}}($$s-term 2-MDNF) for . 6. 5.6.
Give an optimal learning algorithm for -term -MDNF for constant . 7. 5.7.
The class of Read-Once -MDNF is equivalent to learning matchings [11]. Alon et al. gave bounds for deterministic, randomized and -round learning this class. Extend the results to other related classes such as Read-Once -MDNF, Read-Twice -MDNF and Read-Once -DNF.
Section 6
- 6.1.
We show that
[TABLE]
Close the gap between the lower and upper bound. 2. 6.2.
What are the query complexities of the randomized learning algorithms for in [68, 251]? 3. 6.3.
The deterministic adaptive algorithm of Kushilevitz-Mansour [187] asks queries. Find a more query-efficient algorithm. 4. 6.4.
Find a proper learning algorithm for . Can be learned from ? 5. 6.5.
Let be a finite set and be any set. One of the important representations of functions is decision tree over the alphabet with output . A decision tree over with output is defined as follows: The constant functions are decision trees. If is a decision trees for and is a partition of then, for all ,
[TABLE]
is a decision tree (can also be expressed as . Here if and [math] if . Every decision tree can be represented as a tree . If for some then is a node labeled with . If is as in (6.5.), then has a root labeled with and has outgoing edges. The th edge is labeled with and is pointing to the root of . See for example the decision tree of tastes preference in Figure 1.
Find an efficient learning algorithm for decision trees over large alphabet. 6. 6.6.
Find an efficient deterministic non-adaptive learning algorithm for . 7. 6.7.
Study the learnability of , MDTd,s and DL.
Section 7
- 7.1.
The randomized MC query complexity of -Term is less than the deterministic query complexity. It is interesting to study -round LV randomized algorithms for this class. 2. 7.2.
Close the gap between the upper bound and the lower bound of -MJunta. 3. 7.3.
Study the learnability of the subclasses of DNF and CDNF defined in survey. 4. 7.4.
Find and for adaptive and non-adaptive algorithms. 5. 7.5.
Study the learnability of the classes XT, DFA and BMAF. 6. 7.6.
Find . The current upper bound is , and the lower bound is . 7. 7.7.
Find MROF. 8. 7.8.
Study the learnability of the conjunction and disjunction of two MROF. 9. 7.9.
Find a non-adaptive algorithm for -LF (resp. -QF) with queries. 10. 7.10.
Find a randomized algorithm for -LF (resp. -QF) with an optimal number of queries with success probability . 11. 7.11.
Find a deterministic efficient learning algorithm for multiplicity automata function.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Aigner. Combinatorial search. Wiley Teubner Series on Applicable Theory in Computer Science. Teubner, Stuttgart. (1988).
- 2[2] D. Angluin. Queries and concept learning. Machine Learning . 2(4), pp. 319–342. (1988).
- 3[3] D. Angluin. Learning regular sets from queries and counterexamples. Information and Computation . 75. pp. 87–106. (1987).
- 4[4] D. Angluin. Queries revisited. ALT 2001. pp. 12–31. (2001).
- 5[5] R. Ahlswede, H. K. Aydinian. New construction of error-tolerant pooling designs. Information Theory, Combinatorics, and Search Theory . pp. 534–542. (2013).
- 6[6] N. Alon, V. Asodi. Learning a hidden subgraph. SIAM J. Discrete Math. 18(4). pp. 697–712 (2005).
- 7[7] H. Abasi, A. Z. Abdi, N. H. Bshouty. Learning Boolean halfspaces with small weights from membership queries. ALT 2014. pp. 96–110. (2014).
- 8[8] E. Abboud, N. Agha, N. H. Bshouty, N. Radwan, F. Saleh. Learning Threshold functions with small weights using membership queries. COLT 1999. pp. 318–322. (1999).
