Exact Learning from an Honest Teacher That Answers Membership Queries

Nader H. Bshouty

arXiv:1706.03935·cs.LG·June 14, 2017

Exact Learning from an Honest Teacher That Answers Membership Queries

Nader H. Bshouty

PDF

TL;DR

This paper surveys methods for exactly learning functions from an honest teacher through membership queries, highlighting known results, techniques, and open challenges in the field.

Contribution

It provides a comprehensive overview of existing literature, techniques, and open problems in exact learning from membership queries.

Findings

01

Summarizes key results in exact learning from membership queries.

02

Discusses various techniques used in the literature.

03

Identifies open problems and future research directions.

Abstract

Given a teacher that holds a function $f : X \to R$ from some class of functions $C$ . The teacher can receive from the learner an element~ $d$ in the domain $X$ (a query) and returns the value of the function in $d$ , $f (d) \in R$ . The learner goal is to find $f$ with a minimum number of queries, optimal time complexity, and optimal resources. In this survey, we present some of the results known from the literature, different techniques used, some new problems, and open problems.

Equations221

C^{D} = {\overline{f (\overset{x}{ˉ}_{1}, \dots, \overset{x}{ˉ}_{n})} f \in C} .

C^{D} = {\overline{f (\overset{x}{ˉ}_{1}, \dots, \overset{x}{ˉ}_{n})} f \in C} .

f^{'} =

f^{'} =

f = i = 1 ⋁ s T_{i}

f = i = 1 ⋁ s T_{i}

f = M_{1} + M_{2} + \dots + M_{s}

f = M_{1} + M_{2} + \dots + M_{s}

f (x_{1}, \dots, x_{n}) = A_{1} (x_{1}) A_{2} (x_{2}) \dots A_{n} (x_{n})

f (x_{1}, \dots, x_{n}) = A_{1} (x_{1}) A_{2} (x_{2}) \dots A_{n} (x_{n})

f(x_{1},\ldots,x_{n})=\left\{\begin{array}[]{ll}1&\mbox{if}\ w_{1}x_{1}+w_{2}x_{2}+\cdots+w_{n}x_{n}\geq u\\ 0&\mbox{otherwise}\end{array}\right.

f(x_{1},\ldots,x_{n})=\left\{\begin{array}[]{ll}1&\mbox{if}\ w_{1}x_{1}+w_{2}x_{2}+\cdots+w_{n}x_{n}\geq u\\ 0&\mbox{otherwise}\end{array}\right.

f = i \in I \sum a_{i} x_{1}^{i_{1}} \dots x_{n}^{i_{n}}

f = i \in I \sum a_{i} x_{1}^{i_{1}} \dots x_{n}^{i_{n}}

f (x_{1}, \dots, x_{n}) = A_{1} (x_{1}) A_{2} (x_{2}) \dots A_{n} (x_{n})

f (x_{1}, \dots, x_{n}) = A_{1} (x_{1}) A_{2} (x_{2}) \dots A_{n} (x_{n})

\begin{array}[]{rcccl}&&{\rm AF}&\rightarrow&{\rm AC}\\[-6.544pt] &\nearrow&&\nearrow&\\[-7.96674pt] {\rm MP}&\rightarrow&{\rm MAF}&&\end{array}

\begin{array}[]{rcccl}&&{\rm AF}&\rightarrow&{\rm AC}\\[-6.544pt] &\nearrow&&\nearrow&\\[-7.96674pt] {\rm MP}&\rightarrow&{\rm MAF}&&\end{array}

OPT_{NAD}^{ET} (C) = 2 \cdot OPT_{NAD} (C) .

OPT_{NAD}^{ET} (C) = 2 \cdot OPT_{NAD} (C) .

OPT_{AD}^{ET} (C) = 2 \cdot OPT_{AD} (C) .

OPT_{AD}^{ET} (C) = 2 \cdot OPT_{AD} (C) .

TD (H, C) = h \in H max OPT^{HS} (C - h)

TD (H, C) = h \in H max OPT^{HS} (C - h)

OPT_{NAD} (C) = OPT^{HS} (C - C)

OPT_{NAD} (C) = OPT^{HS} (C - C)

OPT_{AD} (C) \geq lo g ∣ C ∣.

OPT_{AD} (C) \geq lo g ∣ C ∣.

OPT_{AD, LV} (C) \geq OPT_{AD, MC} (C) \geq lo g ∣ C ∣ - 1.

OPT_{AD, LV} (C) \geq OPT_{AD, MC} (C) \geq lo g ∣ C ∣ - 1.

∣ {f \in C ∣ (\forall x \in T_{h}) h (x) = f (x)} ∣ \leq 1.

∣ {f \in C ∣ (\forall x \in T_{h}) h (x) = f (x)} ∣ \leq 1.

ETD (C) = h \in 2^{X} max T (C, h) .

ETD (C) = h \in 2^{X} max T (C, h) .

ETD (C) \leq ∣ C ∣.

ETD (C) \leq ∣ C ∣.

ETD (C) \leq TD (2^{X}, C) \leq ETD (C) + 1.

ETD (C) \leq TD (2^{X}, C) \leq ETD (C) + 1.

OPT_{AD} (C) \leq \frac{ETD ( C )}{lo g ETD ( C )} lo g ∣ C ∣ + ETD (C) \leq \frac{2 \cdot ETD ( C )}{lo g ETD ( C )} lo g ∣ C ∣

OPT_{AD} (C) \leq \frac{ETD ( C )}{lo g ETD ( C )} lo g ∣ C ∣ + ETD (C) \leq \frac{2 \cdot ETD ( C )}{lo g ETD ( C )} lo g ∣ C ∣

OPT_{AD} (C) \geq max (ETD (C), lo g ∣ C ∣) .

OPT_{AD} (C) \geq max (ETD (C), lo g ∣ C ∣) .

ϵ ∣ C_{i} ∣ \leq ∣ {f \in C_{i} ∣ f (a) = 0} ∣ \leq (1 - ϵ) ∣ C_{i} ∣.

ϵ ∣ C_{i} ∣ \leq ∣ {f \in C_{i} ∣ f (a) = 0} ∣ \leq (1 - ϵ) ∣ C_{i} ∣.

∣ C_{i + 1} ∣ \leq (1 - ϵ) ∣ C_{i} ∣.

∣ C_{i + 1} ∣ \leq (1 - ϵ) ∣ C_{i} ∣.

∣ C_{i + 1} ∣ \leq ϵ ∣ C_{i} ∣.

∣ C_{i + 1} ∣ \leq ϵ ∣ C_{i} ∣.

\frac{lo g ∣ C ∣ - k lo g ( 1/ ϵ )}{lo g ( 1/ ( 1 - ϵ ))}

\frac{lo g ∣ C ∣ - k lo g ( 1/ ϵ )}{lo g ( 1/ ( 1 - ϵ ))}

\frac{lo g ∣ C ∣ - k lo g ( 1/ ϵ )}{lo g ( 1/ ( 1 - ϵ ))}

\frac{lo g ∣ C ∣ - k lo g ( 1/ ϵ )}{lo g ( 1/ ( 1 - ϵ ))}

(1 + o (1)) \frac{ETD ( C )}{lo g ETD ( C )} lo g ∣ C ∣. \squareforqed

(1 + o (1)) \frac{ETD ( C )}{lo g ETD ( C )} lo g ∣ C ∣. \squareforqed

OPT_{AD} (C_{t, ℓ}) = Ω (\frac{ETD ( C _{t, ℓ} )}{lo g ETD ( C _{t, ℓ} )} lo g ∣ C_{t, ℓ} ∣) .

OPT_{AD} (C_{t, ℓ}) = Ω (\frac{ETD ( C _{t, ℓ} )}{lo g ETD ( C _{t, ℓ} )} lo g ∣ C_{t, ℓ} ∣) .

OPT_{AD} (C) = x \in X min max (OPT_{AD} (C_{x, 0}), OPT_{AD} (C_{x, 1}))

OPT_{AD} (C) = x \in X min max (OPT_{AD} (C_{x, 0}), OPT_{AD} (C_{x, 1}))

min (1 + \frac{lo g ∣ C ∣}{lo g ETD ( C )}, \frac{ETD ( C )}{lo g ∣ C ∣} + \frac{ETD ( C )}{lo g ETD ( C )})

min (1 + \frac{lo g ∣ C ∣}{lo g ETD ( C )}, \frac{ETD ( C )}{lo g ∣ C ∣} + \frac{ETD ( C )}{lo g ETD ( C )})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: Technion, Haifa, Israel

[email protected]

Exact Learning from an Honest Teacher

That Answers Membership Queries

Nader H. Bshouty

Abstract

Given a teacher that holds a function $f:X\to R$ from some class of functions $C$ . The teacher can receive from the learner an element $d$ in the domain $X$ (a query) and returns the value of the function in $d$ , $f(d)\in R$ . The learner goal is to find $f$ with a minimum number of queries, optimal time complexity, and optimal resources.

In this survey, we present some of the results known from the literature, different techniques used, some new problems, and open problems.

1 Introduction
1.1 The Learning Model
1.2 Domain and Range
1.3 Classes of Functions
1.4 Learning Algorithms and Complexity
1.5 Polynomially, Efficiently and Optimally Learnable
1.6 Strongly Polynomially, Efficiently and Optimally Learnable
1.7 Testing Problems
2 Bounds on OPT for Boolean Functions and Algorithms
2.1 OPT for Adaptive Algorithms
2.2 Constructing Adaptive Algorithms
2.3 OPT for Non-Adaptive Algorithms
2.4 OPT for Classes of Small VC-dimension
3 Reductions
3.1 Reductions for Adaptive Algorithms
3.2 Reductions for Strong Adaptive Algorithms
3.3 Reductions for Non-Adaptive Algorithms
3.4 Reductions from the Exact Learning Model
3.5 Reductions from the PAC Learning Model
4 Learning $d$ -MClause and Group Testing
4.1 Group Testing and Applications
4.2 Known Results for Learning $d$ -MClause
4.3 Bounds for OPT( $d$ -MClause)
4.4 Non-Adaptive Learning $d$ -MClause
4.5 Adaptive Learning $d$ -MClause
4.6 Two-Round Learning
4.7 Other Related Problems
5 Learning $s$ -Term $r$ -Monotone DNF
5.1 Learning a Hypergraph and its Applications
5.2 Cover Free Families
5.3 Non-Adaptive Learning $s$ -Term $r$ -MDNF
5.4 Adaptive Learning $s$ -Term $r$ -MDNF
5.5 Learning Subclasses of $s$ -term $r$ -MDNF
6 Learning Decision Tree
6.1 Bounds on ${\rm OPT}({\rm DT}_{d})$
6.2 Adaptive Learning Decision Tree
6.3 Non-Adaptive Learning Decision Tree
7 Other Results
7.1 Other Boolean Classes
7.2 Classes of Arithmetic Functions
8 Non-Honest Teacher
8.1 Models of Non-Honest Teacher
8.2 Some Results in Learning with Non-honest Teacher
9 Problems and Open Problems

1 Introduction

Robert Dorfman’s paper in 1943 introduced the field of Group Testing. The motivation arose during the Second World War when the United States Public Health Service and the Selective service embarked upon a large scale project. The objective was to weed out all syphilitic men called up for induction. However, syphilis testing back then was expensive and testing every soldier individually would have been very cost heavy and inefficient. A basic breakdown of a test is: Draw sample from a given individual, perform required tests and determine the presence or absence of syphilis. Suppose we have $n$ soldiers. Then this method of testing leads to $n$ tests. Our goal is to achieve effective testing in a scenario where it does not make sense to test $100,000$ people to get (say) $10$ positives. The feasibility of a more effective testing scheme hinges on the following property. We can combine blood samples and test a combined sample together to check if at least one soldier has syphilis [277].

Let $S$ be the set of the $n$ soldiers and let $I\subseteq S$ be the set of the sick soldiers. Suppose we know that the number of sick soldiers, $|I|$ , is bounded by some integer $d$ . If $T$ is the set of soldiers for which their blood samples is combined, then the test is positive if and only if $I\cap T$ is not empty. Thus, we can regard the set of sick soldiers $I$ as a Boolean function $f_{I}:2^{S}\to\{0,1\}$ and the answer of the test “Is $I\cap T$ is not empty” as $f_{I}(T)=1$ if and only if $I\cap T\not=\O$ . The goal is to identify the function $f_{I}$ (and therefore the sick soldiers) from a minimal number of substitutions (tests) and optimal time. We can also identify the set of soldiers with the set $[n]:=\{1,2,\ldots,n\}$ and regard each test as an assignment $a\in\{0,1\}^{n}$ , where $a_{i}=1$ if and only if the $i$ th soldier blood is in the test. Then the set $S=\{0,1\}^{n}$ is the set of all possible tests. The set of sick soldiers $I\subseteq[n]$ corresponds to a Boolean function $f^{\prime}_{I}:S\to\{0,1\}$ where $f^{\prime}_{I}(x_{1},\ldots,x_{n})=\bigvee_{i\in I}x_{i}$ and $\vee$ is the Boolean or (disjunction). So this problem is also equivalent to the problem of identifying, a hidden Boolean conjunction of up to $d$ variables, with a minimal number of substitutions and optimal time.

Another interesting problem is the problem of learning decision tree with a minimal number of queries. Let’s say one has a restaurant and she wants to learn each customer tastes preference in food. For every customer, she offers a sample of a meal that was never ordered by the customer before and then receives some feedback. The customer tastes preference depends on some attributes of the food. For example, “sweet”, “sour”, “salty”, “umami”, “bitter”, “greasy”, “hot” etc. Those are the attributes. The goal is to learn (find out) the customer tastes preference from a minimal number of samples. Each sample can be regarded as a set of attributes. The customer tastes preference is the objective function. This function depends on the attributes, and the value of the function is the customer feedback. In many cases, the target function can be described as a decision tree. See the example in Figure 1.

In the following subsection, we give a framework to the above problems and many other similar problems.

1.1 The Learning Model

Let the domain (instance space) be the set $X_{n}\in\{X_{j}\}_{j\geq 1}$ and the range be the set $R_{n}\in\{R_{j}\}_{j\geq 1}$ . Let $C_{n}$ be a class of representations of functions $f:X_{n}\to R_{n}$ (target class, concept class). Given a teacher (black box, opponent player, responder) that holds a (target) function (concept) $f$ from the class $C_{n}$ . The learner (player, questioner) can ask the teacher *membership queries (for Boolean functions. i.e. $R_{n}=\{0,1\}$ ) or substitution queries *(for non-Boolean functions), i.e., it can send the teacher an element $d$ of the domain $X_{n}$ and the teacher returns $f(d)$ . The learner knows $\{C_{j},X_{j},R_{j}\}_{j\geq 1}$ . Our (the learner) ultimate goal is to write an (exact) learning algorithm that learns $C=\cup_{j\geq 1}C_{j}$ with a minimum number of queries and optimal resources. That is,

Input: The learning algorithm receives the input $n$ and has access to an oracle MQf that answers membership/substitution queries for the target function $f\in C_{n}$ . 2. 2.

Query complexity: It asks the teacher a minimum number of membership/substitution queries. 3. 3.

Exact learning: It either learns (finds, outputs) $g\in C_{n}$ such that $g$ is logically equivalent to $f$ , $g=f$ , (proper learning) or learns $h\in H_{n}\supseteq C_{n}$ such that $h=f$ (non-proper learning from $H_{n}$ ). 4. 4.

Resources Complexity: It runs in linear/polynomial/optimal time complexity, optimal space complexity, an optimal number of random bits or/and other optimal resources.

The following decision problems are also considered in the literature

Equivalent test: Given two teachers that have two functions from $C_{n}$ each. Test whether the two functions are equivalent. 2. 2.

Identity test from $H_{n}$ : Given a teacher that has a function $f$ from $C_{n}$ . Given a function $h\in H_{n}$ . Test whether $f=h$ . 3. 3.

Zero test: Given a teacher that has a function $f$ from $C_{n}$ . Test whether $f=~{}0$ .

The number of queries (query complexity) and the resources complexities are expressed as functions in $n$ and some other parameters that depend on the class being learned. In the literature, there are many other variations of the above problems, and we will mention some of them in this survey.

This problem has different names in different areas: Conditional and unconditional Tests [208], Combinatorial Search [177], Interpolation [75], Combinatorial Group Testing [113], Exact Learning from Membership Queries [2], Inferring [152], Identifying [146], Test Recognition [138], Active Learning [243], Reconstruction [195] and Guessing Game [273]. The decision problems are also called Testing, Functional Verification, Teaching, Hitting Set, and when $f$ is polynomial, it is called Black Box polynomial identity testing (PIT) [239, 255].

There are many other learning models, but, throughout this survey, when we say exact learning or learning we mean exact learning from membership queries or substitution queries only.

In this survey, we present some of the results known from the literature, different techniques used and some open problems.

1.2 Domain and Range

Throughout this survey, we will omit the subscript $n$ from $C_{n},X_{n}$ and $R_{n}$ . In principle, the domain $X$ and the range $R$ can be any two sets, but since mathematical models can explain many natural phenomena, most of the sets considered in the literature are either finite or have some algebraic structure such as rings, fields, integers and real numbers.

Therefore, the domains and ranges considered in the literature are: The Boolean set that can be either $\{0,1\}$ , $\{-1,+1\}$ , $\{+,-\}$ or the binary field $F_{2}$ . The finite discrete set can be any finite set or a finite set with some algebraic structure such as the ring $Z_{n}$ of integers modulo $n$ , or the finite field $F_{q}$ with $q$ elements ( $q$ is a power of prime). The infinite discrete set can be any countably infinite set such as the set of integers $Z$ or the set of rational numbers $Q$ . The infinite set (uncountable) can be any set with some algebraic structure such as the real numbers $\Re$ or the complex numbers ${\cal C}$ . Also, the cartesian product of any finite number of the above sets is considered in the literature.

1.3 Classes of Functions

In this section, we will list the most studied classes in the literature, in different fields of computer science.

Boolean Function Classes: When the range of the function is $R_{n}=\{0,1\}$ we call the function Boolean function. Here we will consider classes $C$ of Boolean functions when the domain is $X_{n}=\{0,1\}^{n}$ . For any class defined below when we say that $f$ is $C$ , we mean that $f\in C$ . Abusing the terminology, every function $f\in C$ is regarded as a representation of the function (formula) and as a function, and we will use both interchangeably.

The most studied classes in the literature are:

Variable (Var): The class Var is the class of functions $\{x_{1},\ldots,x_{n}\}$ , where for $a\in\{0,1\}^{n}$ , $x_{i}(a)=a_{i}$ . We also define Lit $=\{x_{1},\ldots,x_{n}\}\cup\{\bar{x}_{1},\ldots,\bar{x}_{n}\}$ the class of literals. Here $\bar{x}$ is the logic negation of $x$ .

Learning the class Var is equivalent to playing the Rényi-Ulam game, [224, 227, 265]. 2. 2.

$d$ -Monotone Clause ( $d$ -MClause) and MClause: The class $d$ -MClause is the class of all functions $f_{S}:\{0,1\}^{n}\to\{0,1\}$ where $S\subseteq[n]:=\{1,2,\ldots,n\}$ and $|S|\leq d$ such that $f_{S}(x_{1},\ldots,x_{n})=1$ if and only if $x_{i}=1$ for some $i\in S$ . When $S=\emptyset$ then $f_{\emptyset}=0$ . Such function can also be expressed as a logic monotone clause $f_{S}=x_{i_{1}}\vee\cdots\vee x_{i_{k}}$ where $S=\{i_{1},\ldots,i_{k}\}$ , $k\leq d$ and $\vee$ is the logic “or” function (disjunction). We denote $n$ -MClause by MClause.

Learning $d$ -MClause is equivalent to group testing, [111, 113, 114]. See many other equivalent problems in [226] and reference within. 3. 3.

$d$ -Clause and Clause: The class $d$ -Clause is the class of all functions $f_{S,R}:\{0,1\}^{n}\to\{0,1\}$ where $S\cap R=\emptyset$ , $S\cup R\subseteq[n]$ and $|S\cup R|\leq d$ such that $f_{S,R}(x_{1},\ldots,x_{n})=1$ if and only if $x_{i}=1$ for some $i\in S$ or $x_{j}=0$ for some $j\in R$ . Such function can be expressed as a logic clause $f_{S,R}=x_{i_{1}}\vee\cdots\vee x_{i_{k}}\vee\bar{x}_{j_{1}}\vee\cdots\vee\bar{x}_{j_{r}}$ where $S=\{i_{1},\ldots,i_{k}\}$ , $R=\{j_{1},\ldots,j_{r}\}$ , and $r+k\leq d$ . We denote $n$ -Clause by Clause. 4. 4.

$d$ -Monotone Term ( $d$ -MTerm), $d$ -Term ( $d$ -Term), MTerm and Term: The same as the above classes, but replace $\vee$ with the logic “and” function $\wedge$ (Conjunction). The functions in MTerm are sometimes called monomials, and the class MTerm is also denoted by Monomial. That is, a monomial is a conjunction of variables, i.e., $x_{j_{1}}\wedge x_{j_{2}}\wedge\cdots\wedge x_{j_{r}}$ where $1\leq j_{1}<j_{2}<\cdots<j_{r}\leq n$ . Here we will sometimes use the arithmetic $\times$ of the field $F_{2}$ for $\wedge$ and write $x_{j_{1}}\wedge x_{j_{2}}\wedge\cdots\wedge x_{j_{r}}$ as $x_{j_{1}}x_{j_{2}}\cdots x_{j_{r}}$ .

For a class $C$ , the dual class of $C$ is the class

[TABLE]

Obviously, $(C^{D})^{D}=C$ , $d$ -ClauseD= $d$ -Term and $d$ -MClauseD= $d$ -MTerm. 5. 5.

$d$ -XOR and XOR: The same as the $d$ -Term class, but replace $\wedge$ with the logic exclusive or function $\oplus$ . Here, we will instead use the arithmetic $+$ of the finite field $F_{2}=\{0,1\}$ . Since $\bar{x}=x+1$ , every function in XOR is of the form $f=x_{i_{1}}+\cdots+x_{i_{k}}+\xi$ where $1\leq i_{1}<i_{2}<\cdots<i_{k}\leq n$ and $\xi\in\{0,1\}$ . 6. 6.

$d$ -Junta: Let $f:\{0,1\}^{n}\to\{0,1\}$ . A variable $x_{i}$ is said to be relevant in $f$ if there are two assignments $a,b\in\{0,1\}^{n}$ such that $a_{i}\not=b_{i}$ , for all $j\not=i$ we have $a_{j}=b_{j}$ , and $f(a)\not=f(b)$ . The class $d$ -Junta is the class of all Boolean functions with at most $d$ relevant variable. This function can be represented by a truth table of size $2^{d}$ of all the relevant variables. 7. 7.

$d$ -MJunta: For two assignments $a,b\in\{0,1\}^{n}$ we write $a\leq b$ if for every $i$ , $a_{i}\leq b_{i}$ . A Boolean function $f:\{0,1\}^{n}\to\{0,1\}$ is monotone if for every two assignments $a,b\in\{0,1\}^{n}$ , if $a\leq b$ then $f(a)\leq f(b)$ . It is easy to see that Monotone functions are closed under disjunction and conjunction. That is, if $f$ and $g$ are monotone functions then $f\wedge g$ and $f\vee g$ are monotone functions.

The class $d$ -MJunta is the class of all monotone functions in $d$ -Junta. That is, the class of all monotone functions with at most $d$ relevant variables. 8. 8.

Decision Tree (DT): One of the important representations of Boolean functions $f:\{0,1\}^{n}\to\{0,1\}$ is decision tree. A decision tree formula is defined as follows: The constant functions [math] and $1$ are decision trees. If $f_{0}$ and $f_{1}$ are decision trees then, for all $i$ ,

[TABLE]

is a decision tree (can also be expressed as $f^{\prime}=x_{i}f_{1}\vee\bar{x}_{i}f_{0}$ or $f^{\prime}=x_{i}f_{1}+\bar{x}_{i}f_{0}$ ). Every decision tree $f^{\prime}$ can be represented as a tree $T(f^{\prime})$ . If $f^{\prime}=1$ or [math] then $T(f^{\prime})$ is a node labeled with $1$ or [math], respectively. If $f^{\prime}=$ [if $x_{i}=0$ then $f_{0}$ else $f_{1}$ ], then $T(f^{\prime})$ has a root labeled with $x_{i}$ and has two outgoing edges. The first edge is labeled with [math] and is pointing to the root of $T(f_{0})$ and the second is labeled with $1$ and is pointing to the root of $T(f_{1})$ . See Figure 2.

The depth of the decision tree $f^{\prime}$ is the depth of the tree $T(f^{\prime})$ . That is the number of edges of the longest path from the root to a leaf in a tree. The size of the decision tree $f^{\prime}$ is the number of leaves in $T(f^{\prime})$ , that is, the number nodes in $T(f^{\prime})$ that are labeled with [math] and $1$ .

Every Boolean function $f:\{0,1\}^{n}\to\{0,1\}$ can be represented as a DT. The representation is not unique. The following are subclasses of DT.

(a)

Depth $d$ Size $s$ Decision Tree (DTd,s): The class ${\rm DT}_{d,s}$ is the class of all decision trees of depth at most $d$ and size at most $s$ . 2. (b)

Depth $d$ Decision Tree (DTd): The class ${\rm DT}_{d}$ is the class of all decision trees of depth at most $d$ . That is, ${\rm DT}_{d}={\rm DT}_{d,2^{d}}$ . 3. (c)

Monotone DT (MDTd,s, MDTd): functions in the above classes that are monotone. 4. (d)

Decision List (DL),[228]: functions $f\in$ DT where every internal node in $T(f)$ is pointing to at least one leaf. 5. (e)

Depth $d$ -Decision List ( $d$ -DL): $d$ -DL is a decision list of depth at most $d$ .

Learning decision tree is equivalent to solving problems in databases, decision table programming, concrete complexity theory, switching theory, pattern recognition, and taxonomy, [206], computer vision, [23].

Disjunctive Normal Form (DNF): A DNF is another important representation of Boolean function $f:\{0,1\}^{n}\to\{0,1\}$ . A DNF formula is a formula of the form

[TABLE]

where each $T_{i}\in$ Term is a term. The size of $f$ is $s$ .

Every Boolean function $f:\{0,1\}^{n}\to\{0,1\}$ can be represented as a DNF. The representation is not unique. It is easy to see that every decision tree of size $s$ can be represented as DNF of size at most $s$ .

The subclasses of DNF considered in the literature are

(a)

$r$ -DNF: The class of DNFs with terms from $r$ -Term. 2. (b)

$s$ -term DNF: The class of DNFs with at most $s$ terms. 3. (c)

$s$ -term $r$ -DNF: The class of DNFs with at most $s$ terms each of which is an $r$ -Term. 4. (d)

Read-Once $C$ : Here $C$ is one of the above classes. Read-Once $C$ is the class of functions $f$ in $C$ where each variable appears at most once in $f$ . 5. (e)

Read-Twice, Read-Thrice, Read- $t$ $C$ : The class of functions $f$ in $C$ where each variable appears at most twice (resp. three times and $t$ times) in $f$ . 10. 10.

Monotone DNF (MDNF): The class MDNF is the class of DNF with monotone terms (i.e., terms in MTerm). Every monotone function (See the definition in item 7) has a monotone DNF representation. This representation is one of the most popular canonical structures for representing Boolean functions. If $f=M_{1}\vee M_{2}\vee\cdots\vee M_{s}$ where each $M_{i}$ is a monomial and no two monomials $M_{i}$ and $M_{j}$ , $i\not=j$ satisfies $M_{i}\wedge M_{j}=M_{i}$ , then we say that $f$ is a reduced monotone DNF. Every monotone Boolean function $f$ has a unique representation as a reduced monotone DNF [1]. This representation is uniquely determined by the minterms of the function. That is, the assignments $a\in\{0,1\}^{n}$ where $f(a)=1$ and flipping any entry that is $1$ in $a$ to [math] changes the value of the function to zero. Each minterm $a$ of $f$ corresponds, one-to-one, to a monomial $M=\vee_{a_{i}=1}x_{i}$ in the reduced monotone DNF representation of $f$ . The following are subclasses of MDNF.

(a)

$r$ -MDNF: The class of MDNFs with monomials of size at most $r$ . That is, terms from $r$ -MTerm. 2. (b)

MDNF: The class MDNF is $n$ -MDNF. 3. (c)

$s$ -term MDNF: The class of MDNFs with at most $s$ monomials. 4. (d)

$s$ -term $r$ -MDNF: The class MDNFs with at most $s$ monomials of size at most $r$ . 5. (e)

Read-Once, Read-Twice, Read-Thrice, Read- $t$ $C$ , where $C$ is one of the above classes, is the class of functions $f$ in $C$ where each variable appears at most once (resp. twice, three times and $t$ times) in $f$ .

Learning Monotone DNF and subclasses of Monotone DNF equivalent to problems in computational biology that arises in whole-genome shotgun sequencing, [11], and DNA phisical mapping, [144]. 11. 11.

Conjunctive Normal Form (CNF): The class CNF is the dual class (See the definition in item 4) of DNF (where $\wedge$ is replaced with $\vee$ and vice versa). In a similar way as above we define the classes $r$ -MCNF, MCNF, $s$ -clause CNF, $s$ -clause MCNF, $s$ -clause $r$ -CNF and $s$ -clause $r$ -MCNF. 12. 12.

CDNF. The class of CDNF is the class of formulas of the form $(f,g)$ where $f$ is a DNF, $g$ is a CNF and $f=g$ . The size of $(f,g)$ as $s+t$ where $f$ is $s$ -term DNF and $g$ is $t$ -clause CNF.

The following are subclasses of CDNF

(a)

CDNFs,t. The class of CDNFs,t is the class CDNF, $(f,g)$ , where $f$ is a DNF of size at most $s$ and $g$ is a CNF of size at most $t$ . 2. (b)

$r$ -CDNFs,t: The class of $(f,g)\in$ CDNFs,t where $f$ is $r$ -DNF of size at most $s$ and $g$ is $r$ -CNF of size at most $t$ . 3. (c)

$r$ -CDNF: The class $\cup_{s,t}$ CDNFs,t. 4. (d)

MCDNF: The class of Monotone CDNF. 5. (e)

MCDNFs,t: The class of Monotone CDNFs,t. 6. (f)

$r$ -MCDNF: The class of Monotone $r$ -CDNF.

Learning CDNF is equivalent to problems in data-mining, graph theory and reasoning and knowledge representation, [118]. 13. 13.

Boolean Multivariate Polynomial (BMP). The class BMP is the class of multivariate polynomials over the binary field $F_{2}$ . That is, a function $f:F_{2}^{n}\to F_{2}$ of the form

[TABLE]

where each $M_{i}$ is a monomial. The size of $f$ is $s$ .

Every Boolean function $f:F_{2}^{n}\to F_{2}$ can be represented as a BMP. The representation is unique. It is easy to see that every decision tree of size $s$ and depth $t$ can be represented as BMP of size at most $2^{t}s$ .

(a)

$r$ -BMP: The class of BMPs with monomials of size at most $r$ , i.e., in $r$ -MTerm. This class is also called the class of multivariate polynomial of degree $r$ over $F_{2}$ . 2. (b)

$s$ -monomial BMP: The class BMPs with at most $s$ monomials. This class is also called the class of sparse multivariate polynomial over $F_{2}$ . 3. (c)

$s$ -monomial $r$ -BMP: The class of BMPs with at most $s$ monomials of size at most $r$ . This class is also called the class of sparse multivariate polynomial of degree $r$ over $F_{2}$ . 14. 14.

XOR of Terms (XT): The class XT is the class of XOR of terms, $T_{1}+T_{2}+\cdots+T_{s}$ where $T_{i}\in$ Term.

(a)

$r$ -XT: The class of XTs with terms of size at most $r$ . 2. (b)

$s$ -term XT: The class of XTs with at most $s$ terms. 3. (c)

$s$ -term $r$ -XT: The class of XTs with at most $s$ terms of size at most $r$ .

Notice that XT with terms from MTerm is BMP. Since every term of size $r$ can be represented as $2^{r}$ -monomial $r$ -BMP, every $s$ -term $r$ -XT is $(2^{r}s)$ -monomial $r$ -BMP. 15. 15.

Deterministic Finite Automaton (DFA),[210]: A DFA is a $5$ -tuple $A=(Q,\Sigma,\delta,q_{0},F)$ that can be also represented as a directed graph $G=(V,E)$ with labeled edges where $V=Q$ is a finite set of states (the vertices), and $q_{0}\in Q$ is the start state. $\Sigma$ is a finite set of symbols called the alphabet. $\delta$ is the transition function $\delta:Q\times\Sigma\to Q$ . The edge $(v,u)\in E$ in $G$ is labeled with $\sigma\in\Sigma$ if and only if $\delta(v,\sigma)=u$ . This transition function defines, for every string $s\in\cup_{i\geq 0}\Sigma^{i}$ , a unique path in the graph $q_{0},q_{1},\ldots,q_{|s|}$ (here, $|s|$ is the number of symbols in $s$ ) that starts from $q_{0}$ and for every $0\leq i\leq|s|-1$ , $\delta(q_{i-1},s_{i})=q_{i}$ . We denote the final state in this path $q_{|s|}$ as $\delta(q_{0},s)$ . The set $F\subset Q$ is the set of accept states.

Every DFA $A$ defines a Boolean function $f:\cup_{i\geq 0}\Sigma^{i}\to\{0,1\}$ where $f(s)=1$ if and only if $\delta(q_{0},s)\in F$ . When $\Sigma=\{0,1\}$ then a DFA for the Boolean function $f:\{0,1\}^{n}\to\{0,1\}$ is a DFA such that: for every $a\in\{0,1\}^{n}$ we have $f(a)=1$ if and only if $\delta(q_{0},a)\in F$ . 16. 16.

Boolean Multiplicity Automata Function (BMAF),[244]: A Boolean Multiplicity Automata Function is a function of the form:

[TABLE]

where each $A_{i}(x_{i})$ is $s_{i}\times s_{i+1}$ matrix that its entries are Boolean univariate polynomials in $x_{i}$ over $F_{2}$ , i.e., $ax_{i}+b$ for $a,b\in F_{2}$ , and $s_{1}=s_{n+1}=1$ . The size of a BMAF is defined as $\max_{i}s_{i}$ .

See [44] for other ways to represent this class. 17. 17.

Boolean Halfspace (Perceptron, Threshold) (BHS): A Boolean Halfspace is a function $f:\{0,1\}^{n}\to\{0,1\}$ of the following form:

[TABLE]

where $w_{1},\ldots,w_{n},u$ are real numbers. The constants $w_{1},\ldots,w_{n}$ are called the weights of the Halfspace, and $u$ is called the threshold. For $W\subseteq\Re$ we define

(a)

BHS $(W)$ : The class of Boolean Halfspaces with weights $w_{i}\in W$ . 2. (b)

$d$ -BHS $(W)$ : The class of functions in BHS $(W)$ with at most $d$ relevant variables. 18. 18.

Boolean Circuit (BC) and Boolean Formula (BF) A Boolean circuit over the set of variables $x_{1},\ldots,x_{n}$ is a directed acyclic graph where every node in it with indegree zero is called an input gate and is labeled by either a variable $x_{i}$ or a Boolean constant $\{0,1\}$ . Every other gate is either a node with indegree one and is labeled $\neg$ (unary NOT) or a node with indegree two and is labeled by either, $\wedge$ (binary AND) or $\vee$ (binary OR). A Boolean formula is a circuit in which every gate has outdegree one.

The size of a Boolean circuit is the number of gates in it, and its depth is the length of the longest directed path in it.

(a)

Monotone Boolean Circuit (MBC) and Monotone Boolean Formula (MBF) MBC and MBF are Boolean circuit and Boolean formula, respectively, with no $\neg$ gate. 2. (b)

Read Once Formula (ROF). The class of Boolean read-once formula. A Boolean read-once formula is a formula such that every input variable $x_{i}$ appears in at most one input gate. 3. (c)

Monotone Read Once Formula (MROF). The class of monotone read-once formula. 4. (d)

Read-Once, Read-Twice, Read-Thrice, Read- $t$ $C$ , where $C$ is one of the above classes, is the class of functions $f$ in $C$ where each variable appears at most once (resp. twice, three times and $t$ times) in $f$ .

See other classes in [1, 2, 17, 26, 33, 34, 58, 73, 92, 93, 110, 118, 132, 158, 251].

Here are relations between some of the classes mentioned above.

[TABLE]

For two classes $C_{1}$ and $C_{2}$ we write $C_{1}\subseteq C_{2}$ (written as $C_{1}\to C_{2}$ in the above diagram) if every function in $C_{1}$ of size $s$ is equivalent to a function in $C_{2}$ of size $O(s)$ .

As for functions that are not Boolean, the literature is poor in studying the exact learnability of classes of functions with finite discrete domain or/and range from membership queries only. On the other hand, there is a substantial body of literature on learning and testing arithmetic classes.

We now give some of the arithmetic classes defined in the literature

Arithmetic Classes: Arithmetic classes represent function $f:X\to R$ where $R$ is an algebraic structure such as field or ring. For exact learning, the most investigated arithmetic classes in the literature are

$(r,V)$ -Linear Functions ( $(r,V)$ -LF), where $r$ is an integer, $V\subset\Re$ and $\Re$ is the set of real numbers. An $(r,V)$ -LF is a function $f:\{0,1\}^{n}\to\Re$ of the form $v_{1}x_{i_{1}}+\cdots+v_{r^{\prime}}x_{i_{r^{\prime}}}$ where $r^{\prime}\leq r$ and $v_{i}\in V$ for all $i=1,\ldots,r^{\prime}$ . The class $r$ -LF is the class $(r,\{0,1\})$ -LF and LF is the class $n$ -LF.

Learning $(r,V)$ -LF is equivalent to coin weighing problem [37] and signature coding problem [50]. 2. 2.

$(r,V)$ -Quadratic Functions ( $(r,V)$ -QF), where $r$ is an integer and $V\subset\Re$ . A $(r,V)$ -QF is a function $f:\{0,1\}^{n}\to\Re$ of the form $x^{T}Ax$ where $x\in\{0,1\}^{n}$ and $A$ is a symmetric $n\times n$ matrix with at most $r$ non-zero entries from $V$ . The class $r$ -QF is the class $(r,\{0,1\})$ -QF.

Learning $(r,V)$ -QF is equivalent to problems in molecular biology [55]. 3. 3.

Multivariate Polynomial (MP): Let $F$ be a field. A multivariate polynomial over $F$ is a function $f:F^{n}\to F$ of the form

[TABLE]

where $I\subseteq N^{n}$ , $N=\{0,1,2,\cdots\}$ and $a_{i}\in F$ . The size of $f$ is $|f|:=|I|$ . The term $x_{1}^{i_{1}}\cdots x_{n}^{i_{n}}$ is called monomial. The monomial is called $t$ -monomial if $|\{j\ |\ i_{j}\not=0\}|\leq t$ . The multivariate polynomial is said to be of degree $d$ if $i_{1}+\cdots+i_{n}\leq d$ for all $i\in I$ , $s$ -sparse if $|I|\leq s$ and with $t$ -monomials if all its monomials are $t$ -monomials.

When the field $F$ is finite then every function $f:F^{n}\to F$ can be represented as a multivariate polynomial. This fact is not true for infinite fields. 4. 4.

Multiplicity Automata Function: A Multiplicity Automata Function (MAF) over the field $F$ is a function of the form

[TABLE]

where each $A_{i}(x_{i})$ is $s_{i}\times s_{i+1}$ matrix that its entries are linear functions in $(x_{1},\ldots,x_{n})$ (i.e., $\sum_{i}a_{i}x_{i}+b$ where $a_{i},b\in\Re$ ) and $s_{1}=s_{n+1}=1$ . The size of a MAF $f$ is $\max_{i}s_{i}$ .

This class contains the class MP in a sense that every MP of size $s$ has a MAF of size $s$ .

See [44] for other representations of MAF. 5. 5.

Arithmetic Circuit (AC) and Arithmetic Formula (AF) An arithmetic circuit over the field $F$ and the set of variables $x_{1},\ldots,x_{n}$ is a directed acyclic graph where every node in it with indegree zero is called an input gate and is labeled by either a variable $x_{i}$ or a field element. Every other gate is labeled by either $+$ or $\times$ , in the first case, it is a sum gate and in the second a product gate. An arithmetic formula is a circuit in which every gate has outdegree one.

The size of a circuit is the number of gates in it, and its depth is the length of the longest directed path in it. The degree of a circuit is equal to the degree of the polynomial output by the circuit. 6. 6.

Arithmetic Read-Once Formula (AROF). An arithmetic read-once formula is a formula such that every input variable $x_{i}$ appears in at most one input gate.

Here are relations between some of the classes we’ve defined

[TABLE]

See other classes in [43, 239, 240, 255] and references therein.

1.4 Learning Algorithms and Complexity

The learning algorithm can be sequential or parallel, deterministic or randomized and adaptive (AD), $r$ -round ( $r$ -RAD) or non-adaptive (NAD).

In the adaptive algorithm, the queries can depend on the answers to the previous ones. In the non-adaptive algorithm they are independent of the previous one and; therefore, one can ask all the queries in one parallel step. We say that an adaptive algorithm is $r$ -round adaptive ( $r$ -RAD) if it runs in $r$ stages where each stage is non-adaptive. That is, the queries may depend on the answers to the queries in the previous stages but independent on the answers to the queries of the current stage.

The randomized algorithm can be either Monte Carlo (MC) or Las Vegas (LV). A Monte Carlo algorithm is a randomized algorithm whose running time is deterministic, but whose output may be incorrect with probability at most $\delta$ . A Las Vegas algorithm is a randomized algorithm that always gives a correct hypothesis. That is, it always produces a hypothesis that is equivalent to the target function. The complexity of a Las Vegas algorithm is measured by the expected running time, the expected number of queries and the expected number of rounds.

The goal is to ask the minimum number of queries and minimize the running time and space complexity of the algorithm and/or other resources such as the number of processors (for parallel algorithms) or the number of random bits (for randomized algorithms).

1.5 Polynomially, Efficiently and Optimally Learnable

In this subsection and the next, we try to unify the different definitions used in the literature of the efficiency of the query complexity and time complexity of exact learning algorithms. We will use the following new terminologies defined below: “learnable”, “polynomially learnable” “efficiently learnable”, “almost optimally learnable” and “optimally learnable”.

Let $C$ be a class of functions. Let ${\rm OPT}_{A}(C)$ be the minimum number of membership queries that a learner, with unlimited computational power, needs to learn $C$ with algorithms of type $A$ . The algorithm type, $A$ , can be adaptive (AD), non-adaptive (NAD) or $r$ -round ( $r$ -RAD). For example, we will use ${\rm OPT}_{\rm AD}$ for the adaptive algorithm and ${\rm OPT}_{\rm NAD}$ for the non-adaptive algorithm. When the algorithm is randomized we also add, as a subscript, MC for Monte Carlo algorithms and LV for Las Vegas algorithms.

In complexity theory, a polynomial time algorithm is an algorithm that runs in polynomial time in the input size. In the exact learning model, the time complexity of learning the class $C$ is, at least, the query complexity, ${\rm OPT}_{A}(C)$ , which can be exponential in the target function size. Therefore, polynomial time learning algorithm for $C$ will be defined as a learning algorithm that asks $poly({\rm OPT}_{A}(C),$ $n)$ queries and runs in time $poly({\rm OPT}_{A}(C),n)$ , where $n$ is the size of the elements in the domain $X$ . Such classes are called polynomially learnable or just learnable classes. This is the definition used in the literature for learnability of classes.

Since the time complexity of any learning algorithm for $C$ is at least $n\cdot{\rm OPT}_{A}(C)$ we may say that learning algorithms that run in time $poly({\rm OPT}_{A}(C),$ $n)$ are “efficient algorithms” in time. However, this is not true for the query complexity. We will argue here, by the following example, that the above definition of $poly({\rm OPT}_{A}(C),n)$ for the query complexity is not the best definition for query-efficiency of exact learning from membership queries.

Take for example the class $C=d$ -MClause. We will show in Subsection 4.5 that ${\rm OPT}_{{\rm AD}}(C)=\Theta(d\log n)$ . Therefore, one would expect that a query-efficient learning algorithm for this class asks $poly(d,\log n)$ queries and not $poly(d\log n,n)=poly(n)$ queries as defined above. The time complexity cannot be less than $n\cdot{\rm OPT}_{A}(C)$ , so the definition of $poly({\rm OPT}_{A}(C),n)$ in the time complexity is passable.

Therefore, we will suggest the following definition for efficient learning. If the algorithm for learning $C$ asks $poly({\rm OPT}_{A}(C))$ queries (rather than $poly({\rm OPT}_{A}(C)$ $,n)$ ) and runs in time $poly({\rm OPT}_{A}(C),n)$ , then we call the class efficiently learnable111We will not use the term “polynomially learnable” for this case to avoid confusion with the definition in the literature..

Another concern with this new definition is that in many areas, (such as combinatorial group testing and game theory) membership query is considered to be very costly. Therefore, one must find polynomial time learning algorithms that ask a minimum number of queries. Therefore, we will introduce here two other definitions: If there is a learning algorithm for $C$ that asks ${\rm OPT}_{A}(C$ $)^{1+o(1)}$ queries and runs in time $poly({\rm OPT}_{A}(C),n)$ , then we call the class almost optimally learnable. If there is a learning algorithm for $C$ that asks $O({\rm OPT}_{A}(C))$ queries and runs in time $poly({\rm OPT}_{A}(C)$ $,n)$ , then we call the class optimally learnable.

In many cases, the query complexity ${\rm OPT}_{A}(C)$ is a function of several parameters that are related to the class $C$ . For example, the query complexity $\Theta(d\log n)$ of $d$ -MClause also depends on $d$ . We say that the query complexity of a learning algorithm is optimal (resp. almost optimal, efficient or polynomial) in some parameter if assuming the other parameters are constant, the query complexity of the algorithm is optimal (resp. almost optimal, efficient or polynomial). So a learning algorithm for $d$ -MClause that asks $d\cdot poly(\log n)$ queries is efficient, optimal in $d$ and efficient in $n$ .

We say that the class $C$ is query-polynomially (resp. query-efficiently, almost query-optimally or query-optimally) learnable in time $T$ if the number of queries is as above (for polynomially, efficiently, almost optimally and optimally, respectively) but the time complexity is $T$ .

We summarize all the above definitions in the following table:

Terminology

Query Complexity

Time Complexity

Polynomially Learnable

or Learnable