ID3 Learns Juntas for Smoothed Product Distributions

Alon Brutzkus; Amit Daniely; Eran Malach

arXiv:1906.08654·cs.LG·June 21, 2019

ID3 Learns Juntas for Smoothed Product Distributions

Alon Brutzkus, Amit Daniely, Eran Malach

PDF

TL;DR

This paper proves that the ID3 algorithm can efficiently learn k-Junta functions under smoothed analysis, demonstrating its effectiveness in noisy environments for functions depending on a logarithmic number of variables.

Contribution

It provides the first theoretical analysis showing ID3 learns k-Juntas in polynomial time when k = log n under smoothed analysis.

Findings

01

ID3 learns k-Juntas in polynomial time for k = log n

02

The analysis applies to noisy, smoothed distributions

03

Supports practical effectiveness of ID3 in noisy settings

Abstract

In recent years, there are many attempts to understand popular heuristics. An example of such a heuristic algorithm is the ID3 algorithm for learning decision trees. This algorithm is commonly used in practice, but there are very few theoretical works studying its behavior. In this paper, we analyze the ID3 algorithm, when the target function is a $k$ -Junta, a function that depends on $k$ out of $n$ variables of the input. We prove that when $k = lo g n$ , the ID3 algorithm learns in polynomial time $k$ -Juntas, in the smoothed analysis model of Kalai & Teng. That is, we show a learnability result when the observed distribution is a "noisy" variant of the original distribution.

Equations96

Gain (S, i) =

Gain (S, i) =

- (P_{S} [x_{i} = 1] C (P_{S} [y = 1∣ x_{i} = 1]) + P_{S} [x_{i} = 0] C (P_{S} [y = 1∣ x_{i} = 0]))

supp (w) = {i \in [n] : w_{i} \neq = *}

supp (w) = {i \in [n] : w_{i} \neq = *}

X_{w} = {x \in X : x_{i} = w_{i} for any i \in supp (w)}

X_{w} = {x \in X : x_{i} = w_{i} for any i \in supp (w)}

S_{w} = {(x, y) \in S : x \in X_{w}}

S_{w} = {(x, y) \in S : x \in X_{w}}

G ain (S_{w}, j) < 2 γ ϵ

G ain (S_{w}, j) < 2 γ ϵ

Gain (S_{w}, i) \geq \frac{β ϵ ^{2}}{8}

Gain (S_{w}, i) \geq \frac{β ϵ ^{2}}{8}

∣ I (S_{w}, i) - I (D_{w}, i) ∣ < ϵ

∣ I (S_{w}, i) - I (D_{w}, i) ∣ < ϵ

E_{S_{w}} [y] = \frac{\sum _{j = 1}^{m} 1 [ x _{j} \in X _{w} ] y _{j}}{\sum _{j = 1}^{m} 1 [ x _{j} \in X _{w} ]} = \frac{\frac{\sum _{j = 1}^{m} 1 [ x _{j} \in X _{w} ] y _{j}}{p _{w} m}}{\frac{\sum _{j = 1}^{m} 1 [ x _{j} \in X _{w} ]}{p _{w} m}}

E_{S_{w}} [y] = \frac{\sum _{j = 1}^{m} 1 [ x _{j} \in X _{w} ] y _{j}}{\sum _{j = 1}^{m} 1 [ x _{j} \in X _{w} ]} = \frac{\frac{\sum _{j = 1}^{m} 1 [ x _{j} \in X _{w} ] y _{j}}{p _{w} m}}{\frac{\sum _{j = 1}^{m} 1 [ x _{j} \in X _{w} ]}{p _{w} m}}

p_{w} E_{D_{w}} y - \frac{\sum _{j = 1}^{m} 1 [ x _{j} \in X _{w} ] y _{j}}{m} ≲ ϵ α^{k} and p_{w} - \frac{\sum _{j = 1}^{m} 1 [ x _{j} \in X _{w} ]}{m} ≲ ϵ α^{k}

p_{w} E_{D_{w}} y - \frac{\sum _{j = 1}^{m} 1 [ x _{j} \in X _{w} ] y _{j}}{m} ≲ ϵ α^{k} and p_{w} - \frac{\sum _{j = 1}^{m} 1 [ x _{j} \in X _{w} ]}{m} ≲ ϵ α^{k}

E_{D_{w}} y - \frac{\sum _{j = 1}^{m} 1 [ x _{j} \in X _{w} ] y _{j}}{p _{w} m} ≲ \frac{ϵ α ^{k}}{p _{w}} ≲ ϵ and 1 - \frac{\sum _{j = 1}^{m} 1 [ x _{j} \in X _{w} ]}{p _{w} m} ≲ \frac{ϵ α ^{k}}{p _{w}} ≲ ϵ

E_{D_{w}} y - \frac{\sum _{j = 1}^{m} 1 [ x _{j} \in X _{w} ] y _{j}}{p _{w} m} ≲ \frac{ϵ α ^{k}}{p _{w}} ≲ ϵ and 1 - \frac{\sum _{j = 1}^{m} 1 [ x _{j} \in X _{w} ]}{p _{w} m} ≲ \frac{ϵ α ^{k}}{p _{w}} ≲ ϵ

∣ E_{S_{w}} [y] - E_{D_{w}} y ∣ ≲ ϵ

∣ E_{S_{w}} [y] - E_{D_{w}} y ∣ ≲ ϵ

∣ E_{S_{w}} [x_{i}] - E_{D_{w}} x_{i} ∣ ≲ ϵ and ∣ E_{S_{w}} [y x_{i}] - E_{D_{w}} y x_{i} ∣ ≲ ϵ

∣ E_{S_{w}} [x_{i}] - E_{D_{w}} x_{i} ∣ ≲ ϵ and ∣ E_{S_{w}} [y x_{i}] - E_{D_{w}} y x_{i} ∣ ≲ ϵ

∣ I (S_{w}, i) ∣ = ∣ I (S_{w}, i) - I (D_{w}, i) ∣ < ϵ

∣ I (S_{w}, i) ∣ = ∣ I (S_{w}, i) - I (D_{w}, i) ∣ < ϵ

∣ P_{S_{w}} [y = 1∣ x_{i} = 1] - P_{S_{w}} [y = 1] ∣

∣ P_{S_{w}} [y = 1∣ x_{i} = 1] - P_{S_{w}} [y = 1] ∣

= \frac{I ( S _{w} , i )}{p _{i} ˉ} < \frac{ϵ}{p _{i} ˉ}

∣ P_{S_{w}} [y = 1∣ x_{i} = 0] - P_{S_{w}} [y = 1] ∣

∣ P_{S_{w}} [y = 1∣ x_{i} = 0] - P_{S_{w}} [y = 1] ∣

= \frac{I ( S _{w} , i )}{1 - p _{i} ˉ} < \frac{ϵ}{1 - p _{i} ˉ}

∣ C (P_{S_{w}} [y = 1∣ x_{i} = 1]) - C (P_{S_{w}} [y = 1]) ∣ \leq γ ∣ P_{S_{w}} [y = 1∣ x_{i} = 0] - P_{S_{w}} [y = 1] ∣ < \frac{γ ϵ}{p _{i} ˉ}

∣ C (P_{S_{w}} [y = 1∣ x_{i} = 1]) - C (P_{S_{w}} [y = 1]) ∣ \leq γ ∣ P_{S_{w}} [y = 1∣ x_{i} = 0] - P_{S_{w}} [y = 1] ∣ < \frac{γ ϵ}{p _{i} ˉ}

∣ C (P_{S_{w}} [y = 1∣ x_{i} = 0]) - C (P_{S_{w}} [y = 1]) ∣ < \frac{γ ϵ}{1 - p _{i} ˉ}

∣ C (P_{S_{w}} [y = 1∣ x_{i} = 0]) - C (P_{S_{w}} [y = 1]) ∣ < \frac{γ ϵ}{1 - p _{i} ˉ}

∣ G ain (S_{w}, i) ∣ =

∣ G ain (S_{w}, i) ∣ =

+ P_{S_{w}} [x_{i} = 0] C (P_{S_{w}} [y = 1∣ x_{i} = 0])) ∣

\leq

+ P_{S_{w}} [x_{i} = 0] ∣ C (P_{S_{w}} [y = 1∣ x_{i} = 0]) - C (P_{S_{w}} [y = 1]) ∣ < 2 γ ϵ

∣ I (S_{w}, i) - I (D_{w}, i) ∣ \leq \frac{ϵ}{2}

∣ I (S_{w}, i) - I (D_{w}, i) ∣ \leq \frac{ϵ}{2}

P_{S_{w}} [x_{i} = 1] P_{S_{w}} [x_{i} = 0] (P_{S_{w}} [y = 1∣ x_{i} = 0] - P_{S_{w}} [y = 1∣ x_{i} = 0])

P_{S_{w}} [x_{i} = 1] P_{S_{w}} [x_{i} = 0] (P_{S_{w}} [y = 1∣ x_{i} = 0] - P_{S_{w}} [y = 1∣ x_{i} = 0])

= P_{S_{w}} [x_{i} = 0] P_{S_{w}} [x_{i} = 1 \land y = 1] - P_{S_{w}} [x_{i} = 1] P_{S_{w}} [x_{i} = 0 \land y = 1]

= P_{S_{w}} [x_{i} = 0] P_{S_{w}} [x_{i} = 1 \land y = 1] - P_{S_{w}} [x_{i} = 1] (P_{S_{w}} [y = 1] - P_{S_{w}} [x_{i} = 1 \land y = 1])

= P_{S_{w}} [x_{i} = 1 \land y = 1] - P_{S_{w}} [x_{i} = 1] P_{S_{w}} [y = 1] = I (S_{w}, i)

P_{S_{w}} [y = 1∣ x_{i} = 1] - P_{S_{w}} [y = 1∣ x_{i} = 0] = \frac{I ( S _{w} , i )}{p _{i} ˉ ( 1 - p _{i} ˉ )}

P_{S_{w}} [y = 1∣ x_{i} = 1] - P_{S_{w}} [y = 1∣ x_{i} = 0] = \frac{I ( S _{w} , i )}{p _{i} ˉ ( 1 - p _{i} ˉ )}

C (t a + (1 - t) b) \geq tC (a) + (1 - t) C (b) + \frac{β}{2} t (1 - t) (a - b)^{2}

C (t a + (1 - t) b) \geq tC (a) + (1 - t) C (b) + \frac{β}{2} t (1 - t) (a - b)^{2}

\overset{p_{i}}{ˉ} C (P_{S_{w}} [y = 1∣ x_{i} = 1]) + (1 - \overset{p_{i}}{ˉ}) C (P_{S_{w}} [y = 1∣ x_{i} = 0])

\overset{p_{i}}{ˉ} C (P_{S_{w}} [y = 1∣ x_{i} = 1]) + (1 - \overset{p_{i}}{ˉ}) C (P_{S_{w}} [y = 1∣ x_{i} = 0])

\leq C (P_{S_{w}} [y = 1]) - \frac{β}{2} \overset{p_{i}}{ˉ} (1 - \overset{p_{i}}{ˉ}) (P_{S_{w}} [y = 1∣ x_{i} = 1] - P_{S_{w}} [y = 1∣ x_{i} = 0])^{2}

= C (P_{S_{w}} [y = 1]) - \frac{β}{2} \cdot \frac{I ( S _{w} , i ) ^{2}}{p _{i} ˉ ( 1 - p _{i} ˉ )}

G ain (S_{w}, i) \geq \frac{β}{2} \cdot \frac{I ( S _{w} , i ) ^{2}}{p _{i} ˉ ( 1 - p _{i} ˉ )} \geq \frac{β}{2} I (S_{w}, i)^{2}

G ain (S_{w}, i) \geq \frac{β}{2} \cdot \frac{I ( S _{w} , i ) ^{2}}{p _{i} ˉ ( 1 - p _{i} ˉ )} \geq \frac{β}{2} I (S_{w}, i)^{2}

∣ I (D_{w}, j) ∣ > α^{2} (2 c)^{k - 1}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

ID3 Learns Juntas for Smoothed Product Distributions

Alon Brutzkus

&Amit Daniely &Eran Malach The Blavatnik School of Computer Science Tel Aviv University, Israel School of Computer Science The Hebrew University, Israel School of Computer Science The Hebrew University, Israel

Abstract

In recent years, there are many attempts to understand popular heuristics. An example of such a heuristic algorithm is the ID3 algorithm for learning decision trees. This algorithm is commonly used in practice, but there are very few theoretical works studying its behavior. In this paper, we analyze the ID3 algorithm, when the target function is a $k$ -Junta, a function that depends on $k$ out of $n$ variables of the input. We prove that when $k=\log n$ , the ID3 algorithm learns in polynomial time $k$ -Juntas, in the smoothed analysis model of [20]. That is, we show a learnability result when the observed distribution is a “noisy” variant of the original distribution.

1 Introduction

In recent years there has been a growing interest in analyzing machine learning algorithms that are commonly used in practice. A primary example is the gradient-descent algorithm for learning neural-networks, which achieves remarkable performance in practice but has very little formal guarantees. A main approach in studying such algorithms is proving that they are able to learn models that are known to be learnable. For examples, it has been shown that SGD can learn neural-networks when the target function is linear, or belongs to a certain kernel space [7, 34, 12, 13, 27, 1, 2, 3, 28, 25, 24].

In this paper we take a similar approach aiming to give theoretical guarantees for the ID3 algorithm [29]

a popular algorithm for learning decision trees. We analyze the behavior of this algorithm when the target function is a $k$ -Junta, a function that depends only on $k$ bits from the input, and the underlying distribution is a product distribution, where the bits in the input examples are independent. While we cannot guarantee that the ID3 algorithm learns under any such distribution, as there are distributions which fail the algorithm, we show that the algorithm can learn “most” such distributions. That is, we show that for any product distribution and a $k$ -Junta, the ID3 algorithm learns the junta over a “noisy” variant of the original distribution. Such a result is in the spirit of smoothed analysis [33], which is often used to give results when a worst-case analysis is not satisfactory.

Related Work

There are a number of works studying the learnability of decision trees [30, 23, 4, 14, 9, 8, 10]. We next elaborate on papers that analyze decision trees under product distributions, as we do. The work of [20] gives learnability results of decision trees for product distributions with smoothed analysis, in a problem setting similar to ours. Their work analyzes an algorithm that estimates the Fourier coefficients of the target function in order to learn the decision tree. Another work [26] proves learnability of decision trees implementing monotone Boolean functions under the uniform distribution. Other algorithms for learning decision trees under the uniform distribution are given in [19, 18], again relying on Fourier analysis of the target function. Another work [11] gives an algorithm for learning stochastic decision trees under the uniform distribution. The work of [5] gives negative results on learning polynomial size decision trees under the uniform distribution in the statistical query setting.

While the above works study learnability of decision trees under various distributional assumptions, they all consider algorithms that are very different from algorithms used in practice. Our work, on the contrary, gives guarantees for algorithms that enjoy empirical success. In the current literature there are very few works that analyze such algorithms. Notably, the work of [17] studies the class of impurity-based algorithms, which contains the ID3 algorithm. This work shows that unate functions, like linear threshold function or read-once DNF, are learnable under the uniform distribution, using impurity-based algorithms. Our work, on the other hand, considers a different choice of target functions (Juntas), and shows learnability under “most” distributions, and not only for a fixed distribution. Another work that studies an algorithm used in practice [22] shows that the CART and C4.5 algorithms can leverage weak approximation of the target function, and thus can perform boosting. However, it is not clear whether such weak approximation typically happens, and in what cases this result can be applied. In contrast, our results apply for a concrete family of functions and distributions.

1.1 Problem Setting

The ID3 Algorithm

Let $\mathcal{X}=\{0,1\}^{n}$ be the domain set and let $\mathcal{Y}=\{0,1\}$ be the label set. We next describe the ID3 algorithm, following the presentation in [31]. Define an impurity function $C$ to be any concave function $C:[0,1]\to\mathbb{R}$ , satisfying that $C(x)=C(1-x)$ and $C(0)=C(1)=0$ . Given an impurity function $C$ , a sample $S\subset\mathcal{X}\times\mathcal{Y}$ and an index $i\in[n]$ , we define the gain measure to be as follows:

[TABLE]

Given a sample $S\subseteq\mathcal{X}\times\mathcal{Y}$ , the ID3 algorithm generates a decision tree in a recursive manner. At each step of the recursion, the algorithm chooses the feature $x_{j}$ to be assigned to a current node. The algorithm iterates over all the unused features, and calculates the gain measure with respect to the examples that reach the current node. Then, it chooses the feature that maximizes the gain. This algorithm is described formally in algorithm 1. The output of the algorithm is given by the initial call to $\text{ID3}(S,[n])$ .

Learning Juntas

A $k$ -Junta is a function $f:\{0,1\}^{n}\to\{0,1\}$ that depends on $k$ coordinates. Namely, there is a set $J=\{i_{1}<i_{2}<\ldots<i_{k}\}\subset[n]$ and a function $\tilde{f}:\{0,1\}^{k}\to\{0,1\}$ such that $f({\bm{x}})=\tilde{f}(x_{i_{1}},\ldots,x_{i_{k}})$ . In this case, we will say that $f$ is supported in $J$ . Throughout the paper, we assume that the examples are sampled from a product distribution $\mathcal{D}$ over $\mathcal{X}\times\mathcal{Y}$ , that is realizable by a $\log(n)$ -Junta. Namely, we assume that for $({\bm{x}},y)\sim\mathcal{D}$ , ${\bm{x}}\sim\prod_{i=1}^{n}\text{Bernoulli}(p_{i})$ for some $p_{1},\dots,p_{n}\in[0,1]$ , and $y=f({\bm{x}})$ for some $\log(n)$ -Junta $f$ . The main goal of this paper is to show that for “most" product distributions, the ID3 algorithm succeeds to learn $\log(n)$ -Juntas in polynomial time. Namely, it will return a tree $T$ whose generalization error, $\mathcal{L}_{\mathcal{D}}(T):=\Pr_{({\bm{x}},y)\sim\mathcal{D}}\left(T({\bm{x}})\neq y\right)$ , is small (in fact, zero). We note that the sample complexity of learning $\omega\left(\log(n)\right)$ -Juntas is super polynomial, hence, $\log(n)$ -Juntas is the best that we can hope to learn in polynomial time.

1.2 Results

We will show two positive results for learning $\log(n)$ -Juntas. The first establishes learnability of parities, while the second is about learnability of general Juntas. Thruought, we assume that the impurity function $C$ is strongly concave and Lipschitz.

Learning Parities

A $k$ -parity is a function of the form $\chi_{J}({\bm{x}})=\begin{cases}1&\sum_{i\in J}x_{i}\text{ is odd}\\ 0&\sum_{i\in J}x_{i}\text{ is even}\end{cases}$ , where $J\subset[n]$ is a set of $k$ indices. Note that any $k$ -parity is a $k$ -Junta. We first consider leranability of $\log(n)$ -parities by the ID3 algorithm111As opposed to general $k$ -Juntas, $k$ -parities with any $k$ are learnable in polynomial time. Yet, in the context of decision tree algorithms, we cannot hope to learn $k$ -parities with $k=\omega(\log(n))$ . Indeed, such parities cannot be computed, or even approximated, by a poly-sized tree..

Learning parity functions is a classical problem in machine learning, for which there exists an efficient algorithm[15, 16]. Still, parities often serve as a hard benchmark, as many common algorithms cannot learn these functions [6, 32]. In the case of the ID3 algorithm, when the underlying distribution is uniform (i.e, when $p_{i}=\frac{1}{2}$ for all $i\in[n]$ ), the algorithm fails to learn parity functions [21]. We show that the case of the uniform distribution is in some sense unique. That is, we show that for every distribution that is not “too close” to the uniform distribution, the ID3 algorithm succeeds to learn any such parity function. To this end, we say that $\mathcal{D}$ is $(\alpha,c)$ -distributuion if $\left|p_{i}-\frac{1}{2}\right|>c$ and $p_{i}\in(\alpha,1-\alpha)$ for any $i\in[n]$ .

Theorem 1.

Fix $\alpha,c>0$ . There is a polynomial222The polynomial $p$ depends on $\alpha,c$ and the impurity function $C$ . See theorem 4 for a detailed dependency. $p$ for which the following holds. Suppose that the ID3 algorithm runs on $p\left(n,\log\left(\frac{1}{\delta}\right)\right)$ examples from an $(\alpha,c)$ -distribution $\mathcal{D}$ that is realized by a $\log(n)$ -parity. Then, w.p. $\geq 1-\delta$ , ID3 will output a tree $T$ with $\mathcal{L}_{\mathcal{D}}(T)=0$ .

Smoothed Analysis of Learning General Juntas

For general Juntas, instead of standard worst-case analysis, where we require that the algorithm succeeds to learn any distribution, we will show that the algorithm learns most distributions. Namely, for every fixed distribution, we show that the algorithm succeeds to learn, with high probability, a “noisy” version of this distribution. Formally, a smoothened $(\alpha,c)$ -distribution $\mathcal{D}$ is a random distribution where $p_{i}=\hat{p}_{i}+\Delta_{i}$ for some $\hat{p}_{i}\in\left(\alpha+c,1-\alpha-c\right)$ and $\Delta_{i}\sim Uni([-c,c])$ .

Theorem 2.

Fix $\alpha,c>0$ . There is a polynomial333The polynomial $p$ again depends on $\alpha,c$ and the impurity function $C$ . See theorem 5 for a detailed dependency. $p$ for which the following holds. Suppose that the ID3 algorithm runs on $p\left(n,\frac{1}{\delta}\right)$ examples from a smoothened $(\alpha,c)$ -distribution $\mathcal{D}$ that is realized by a $\log(n)$ -junta. Then, w.p. $\geq 1-\delta$ , ID3 will output a tree $T$ with $\mathcal{L}_{\mathcal{D}}(T)=0$ .

1.3 Open Question

We now turn to discussing possible open questions and future directions arising from this work. Our main result applies for the case where the target function is a $k$ -Junta, which can be implemented by a tree of depth $k=\log n$ . An immediate open question is whether a similar learnability result can be shown for general trees of depth $\log n$ . We conjecture that this is indeed the case.

Conjecture 1.

Fix $\alpha,c>0$ . There is a polynomial $p$ for which the following holds. Suppose that the ID3 algorithm runs on $p\left(n,\frac{1}{\delta},\frac{1}{\epsilon}\right)$ examples from a smoothened $(\alpha,c)$ -distribution $\mathcal{D}$ that is realized by a $\log(n)$ -depth-tree. Then, w.p. $\geq 1-\delta$ , ID3 will output a tree $T$ with $\mathcal{L}_{\mathcal{D}}(T)\leq\epsilon$ .

As we previously mentioned, our work could be viewed in a broader context of understanding heuristic learning algorithms that enjoy empirical success. In this field of research, a main challenge of the machine learning community is to understand the behavior of neural-networks learned with gradient-based algorithms. While our analysis is focused on proving results for the ID3 algorithm, we believe that similar techniques could be used to show similar results for learning neural-networks with gradient-descent. Specifically, we raise the following interesting question:

Open Question 1.

Can gradient-descent learn neural-networks when the target function is a $k$ -Junta, in the smoothed analysis setting?

2 Proofs

2.1 General Approach

Throughout, we assume that $\mathcal{D}$ is a distribution that is realized by a Junta $f$ , supported in $J\subset[n]$ , with $|J|=k$ . We assume w.l.o.g. that $J=[k]$ .

To prove our result, we will show that w.h.p., the algorithm chooses only variables from $[k]$ , and furthermore, any root-to-leaf path will contain all the variables from $[k]$ . In this case, the resulting tree will have zero generalization error. To formalize this, we will use the following notation. We define the support of a vector $w\in\{*,0,1\}^{n}$ as

[TABLE]

and let

[TABLE]

For a sample $S\subseteq\mathcal{X}\times\mathcal{Y}$ , we denote

[TABLE]

Finally, for a distribution $\mathcal{D}$ we denote $\mathcal{D}_{w}=\mathcal{D}|_{x\in\mathcal{X}_{w}}$

Lemma 1.

Suppose that the sample $S$ is realized by $f$ . Assume that for any $w\in\{0,1,*\}^{n}$ with $\text{supp}(w)\subset J$ we have $S_{w}\neq\emptyset$ and either of the following holds:

•

All examples in $S_{w}$ have the same label.

•

For all $i\in J\setminus\text{supp}(w)$ and $j\in[n]\setminus J$ we have $\text{Gain}(S_{w},i)>\text{Gain}(S_{w},j)$

Then, the ID3 algorithm will build a tree with zero loss on $\mathcal{D}$ .

Not surprisingly, the gain of coordinates outside of $J$ is always small. This is formalized in the following lemma.

Lemma 2.

Assume that $C$ is $\gamma$ -Lipschitz. Fix $w\in\{0,1,*\}^{n}$ , with $|\text{supp}(w)|\leq k$ , $j\in[n]\setminus J$ and $\epsilon,\delta\in(0,1)$ . Assume we sample $S\sim\mathcal{D}^{m}$ with $m\gtrsim\epsilon^{-2}\alpha^{-2k}\log(\frac{1}{\delta})$ . Then with probability at least $1-\delta$ we have $S_{w}\neq\emptyset$ and:

[TABLE]

Given lemma 2, in order to apply lemma 1, it remains to show that the gain of the coordinates in $J$ is large. To this end, we will use a measure of dependence between a coordinate $x_{i}$ and the label $y$ , which we define next. For a sample $S\subset\mathcal{X}\times\mathcal{Y}$ and an index $i\in[n]$ , we let $\mathcal{I}(S,i)=\mathbb{E}_{S}\left[y\right]\mathbb{E}_{S}\left[x_{i}\right]-\mathbb{E}_{S}\left[yx_{i}\right]$ . Similarly, for a distribution $\mathcal{D}$ over $\mathcal{X}\times\mathcal{Y}$ we let $\mathcal{I}(\mathcal{D},i)=\mathbb{E}_{\mathcal{D}}\left[y\right]\mathbb{E}_{\mathcal{D}}\left[x_{i}\right]-\mathbb{E}_{\mathcal{D}}\left[yx_{i}\right]$ . Note that $x_{j}$ and $y$ are independent if and only if $\mathcal{I}(\mathcal{D},i)=0$ . The following lemma connects $\text{Gain}(S_{w},i)$ to $\mathcal{I}(\mathcal{D}_{w},i)$ .

Lemma 3.

Assume $C$ is $\beta$ strongly concave (i.e, $-C$ is $\beta$ strongly convex). Assume for some $w\in\{0,1,*\}^{n}$ , with $\text{supp}(w)\leq k$ and index $i\in[n]$ we have $|\mathcal{I}(\mathcal{D}_{w},i)|\geq\epsilon>0$ . Fix $\delta>0$ . Then, if we sample $S\sim\mathcal{D}^{m}$ for $m\gtrsim\epsilon^{-2}\alpha^{-2k}\log(\frac{1}{\delta})$ , then with probability at least $1-\delta$ we have $S_{w}\neq\emptyset$ and:

[TABLE]

Combining lemmas 1, 2 and 3, we get the following theorem:

Theorem 3.

Assume $C$ is $\beta$ strongly concave and $\gamma$ -Lipschitz. Assume for any $w\in\{0,1,*\}^{n}$ , with $\text{supp}(w)\subset J$ we have $S_{w}\neq\emptyset$ and either of the following holds:

•

All examples in $\mathcal{D}_{w}$ have the same label.

•

For every index $i\in J\setminus\text{supp}(w)$ we have $|\mathcal{I}(\mathcal{D}_{w},i)|\geq\epsilon>0$ .

Fix $\delta>0$ . Then, if we sample $S\sim\mathcal{D}^{m}$ for $m\gtrsim\beta^{-2}\gamma^{2}\epsilon^{-4}\alpha^{-2k}k\log(\frac{n}{\delta})$ , then with probability at least $1-\delta$ the ID3 algorithm will build a tree with zero loss on $\mathcal{D}$

By the above theorem, in order to show that the ID3 algorithm succeeds in learning, it is enough to lower bound $|\mathcal{I}(\mathcal{D}_{w},i)|$ . This is done in the remaining sections, together with the proof of lemmas 1, 2 and 3.

2.2 Proof of the basic lemmas

Proof.

(of lemma 1) At every iteration, the ID3 algorithm assigns a splitting variable for a given node, or otherwise returns a leaf for this node. We will show that for every node that the algorithm iterates on, if the path from the root to this node contains only variables from $J$ , then either the algorithm adds a splitting variable from $J$ , or the algorithm returns a leaf. Indeed, assume that the path from the root to this node contains only variables from $J$ . We can decode the root-to-node path by a vector $w\in\{*,0,1\}^{n}$ , where $w_{i}=1$ if the node $x_{i}=1$ is in the path, $w_{i}=0$ if the node $x_{i}=0$ is in the path, and $w_{i}=*$ otherwise. Therefore, by our assumption we have $\text{supp}(w)\subseteq J$ . Note that in this case, the algorithm observes the sample $S_{w}$ , so if all examples in $S_{w}$ have the same label, then the algorithm returns a leaf. Otherwise, by the assumption we get $\operatorname*{arg\,max}_{i\in A}Gain(S_{w},i)\in J$ , so the algorithm chooses a splitting variable from $J$ .

From the above, the algorithm adds only splitting variables from $J$ , so it can build a tree of size at most $2^{k}$ before stopping. This tree has zero loss on the distribution. Indeed, for any $x^{\prime}\in\{0,1\}^{k}$ , denote $w(x^{\prime})\in\{0,1\}^{n}$ such that $w(x^{\prime})_{i}=x^{\prime}_{i}$ for every $i\in[k]$ and $w(x^{\prime})_{i}=*$ for every $i\notin[k]$ . Then, since we assume $S_{w(x^{\prime})}\neq\emptyset$ , there exists a sample $(x,y)\in S$ such that $x_{i}=x^{\prime}_{i}$ for every $i\in[k]$ . By definition, the algorithm returns a tree that correctly labels the example $x$ , therefore it returns a function that agrees with $f(x)=\tilde{f}(x^{\prime})$ . Since this is true for every choice of $x^{\prime}\in\{0,1\}^{k}$ , the function returned by the tree agrees with the Junta defined by $\tilde{f}$ , so it gets zero loss. ∎

We next relate the empirical measure $\mathcal{I}(S,i)$ to $\mathcal{I}(\mathcal{D},i)$ .

Lemma 4.

Fix $w\in\{0,1,*\}^{n}$ , with $\text{supp}(w)\leq k$ , $i\in[n]$ , $\epsilon,\delta\in(0,1)$ . Let $S\sim\mathcal{D}^{m}$ with $m\gtrsim\alpha^{-2k}\epsilon^{-2}\log(\frac{1}{\delta})$ . Then with probability at least $1-\delta$ we have $S_{w}\neq\emptyset$ and:

[TABLE]

Proof.

Denote $S=\{({\bm{x}}_{1},y),\dots,({\bm{x}}_{m},y)\}$ . Let $\bar{p_{i}}=\mathbb{E}_{S_{w}}\left[x_{i}\right]$ and $p_{w}=\Pr_{x\sim\mathcal{D}}\left(x\in\mathcal{X}_{w}\right)$ . We have

[TABLE]

By Hoeffding’s bound, with probability $\geq 1-\frac{\delta}{3}$ , we have

[TABLE]

dividing by $p_{w}$ we get

[TABLE]

Notice that from the above we get that $S_{w}\neq\emptyset$ . It follows that

[TABLE]

Similarly,

[TABLE]

In this case, we have $\left\lvert\mathcal{I}(S_{w},i)-\mathcal{I}(\mathcal{D}_{w},i)\right\rvert<\epsilon$ ∎

We next prove lemmas 2 and 3

Proof.

(of lemma 2) Notice that since $x_{i}$ and $y$ are independent, we have $\mathcal{I}(\mathcal{D}_{w},i)=0$ . By the choice of $m$ , from Lemma 4 we get that with probability $1-\delta$ :

[TABLE]

Denote $\bar{p_{i}}=\mathbb{P}_{S_{w}}\left[x_{i}=1\right]$ . Notice that if $\bar{p_{i}}=0$ or $\bar{p_{i}}=1$ then $\mathcal{I}(S_{w},i)=0$ , and the result trivially holds. We can therefore assume $\bar{p_{i}}\in(0,1)$ . Now, we have the following:

[TABLE]

Similarly, we get:

[TABLE]

Using the $\gamma$ -Lipschitz property, we get:

[TABLE]

And similarly:

[TABLE]

Now plugging into the gain definition:

[TABLE]

∎

Proof.

(of lemma 3) By the choice of $m$ , from Lemma 4 we get that with probability $1-\delta$ :

[TABLE]

Since we assume $|\mathcal{I}(\mathcal{D}_{w},i)|\geq\epsilon$ , we get that $|\mathcal{I}(S_{w},i)|\geq\frac{\epsilon}{2}$ . Therefore, we have that $\bar{p_{i}}\in(0,1)$ (again denoting $\bar{p_{i}}=\mathbb{P}_{S_{w}}\left[x_{i}=1\right]$ ). Observe that we have the following:

[TABLE]

Therefore, we have:

[TABLE]

Since $C$ is $\beta$ strongly concave we get that for all $a,b,t\in[0,1]$ we have:

[TABLE]

Using this property we get that:

[TABLE]

Plugging this to the gain equation we get:

[TABLE]

Since $|\mathcal{I}(S_{w},i)|\geq\frac{\epsilon}{2}$ , we get $Gain(S_{w},i)\geq\frac{\beta\epsilon^{2}}{8}$ .

∎

2.3 Parities

Lemma 5.

Let $\mathcal{D}$ be a distribution on $\mathcal{X}\times\mathcal{Y}$ labelled by $\chi_{J}$ with $|J|\leq k$ . Assume that for every $j\in J$ we have $p_{j}\in(\alpha,1-\alpha)$ and $|p_{j}-\frac{1}{2}|\geq c$ , for some $c,\alpha>0$ . Fix some $w\in\{0,1,*\}^{k}$ . Then for every $j\in J\setminus\text{supp}(w)$ we have:

[TABLE]

By theorem 3 we have

Theorem 4.

Let $\mathcal{D}$ be a distribution on $\mathcal{X}\times\mathcal{Y}$ labelled by $\chi_{J}$ with $|J|\leq k$ . Assume that for every $j\in J$ we have $p_{j}\in(\alpha,1-\alpha)$ and $|p_{j}-\frac{1}{2}|\geq c$ , for some $c,\alpha>0$ . Assume furthermore that $C$ is $\beta$ strongly concave and $\gamma$ -Lipschitz.

Then, if we sample $S\sim\mathcal{D}^{m}$ for $m\gtrsim\beta^{-2}\gamma^{2}(2c)^{-4k-4}\alpha^{-2k-8}k\log(\frac{n}{\delta})$ , then with probability at least $1-\delta$ the ID3 algorithm will build a tree with zero loss on $\mathcal{D}$

Note that since we assume $k\leq\log n$ , the runtime and sample complexity in the above theorem are polynomial in $n$ . We give the proof of this theorem in the rest of this section.

Proof.

Denote $\epsilon_{i}:=p_{i}-\frac{1}{2}$ , and $k^{\prime}:=|A|$ . For simplicity of notation, assume w.l.o.g that $A=[k^{\prime}]$ and $j=k^{\prime}$ . Observe the following:

[TABLE]

Similarly, we get that:

[TABLE]

Therefore, we get that:

[TABLE]

∎

2.4 Juntas

Lemma 6.

Fix some $w\in\{*,0,1\}^{k}$ , and assume not all examples in $\mathcal{D}_{w}$ have the same label. Let $A=\{i\in[k]~{}:~{}w_{i}=*\}$ . Assume $p_{i}\in(\alpha,1-\alpha)$ for $\alpha>0$ for every $i$ , and fix $\delta>0$ . Then there exists $i\in A\cap[k]$ such that with probability $1-\delta$ over the choice of $\Delta$ :

[TABLE]

By theorem 3 we get

Theorem 5.

Assume $C$ is $\beta$ strongly concave and $\gamma$ -Lipschitz. Fix $\delta_{1},\delta_{2}>0$ . Then, if we sample $S\sim\mathcal{D}^{m}$ for $m\gtrsim\beta^{-2}\gamma^{2}c^{-8k}\delta_{1}^{-8}\alpha^{-2k-8}k\log(\frac{n}{\delta_{2}})$ , then with probability at least $1-\delta_{1}-\delta_{2}$ the ID3 algorithm will build a tree with zero loss on $\mathcal{D}$

Proof.

For simplicity of notation, we assume w.l.o.g. that $A=[k^{\prime}]$ for some $k^{\prime}\leq k$ . Denote $f_{w}:\{0,1\}^{k^{\prime}}\to\{0,1\}$ , such that $f_{w}(x_{1},\dots,x_{k^{\prime}})=f(x_{1},\dots,x_{k^{\prime}},w_{k^{\prime}+1},\dots,w_{k})$ . Observe the Fourier coefficients of $f_{w}$ :

[TABLE]

Where $\chi_{I}=\prod_{i\in I}(2x_{i}-1)$ , and note that $\chi_{I}$ is a Fourier basis (w.r.p to the unifrom distribution). Notice that $|\alpha_{I}|\geq\frac{1}{2^{k}}$ for every $\alpha_{I}\neq 0$ . Indeed, we have:

[TABLE]

where $\chi_{I}({\bm{x}})f({\bm{x}})\in\{-1,0,1\}$ , and this gives the required. Since not all examples in $\mathcal{D}_{w}$ have the same label, we know that $f_{w}$ is not a constant function. Therefore, there exists $\emptyset\neq I_{0}\subseteq[k^{\prime}]$ such that $\alpha_{I_{0}}\neq 0$ . Fix some $i\in I_{0}$ , and we assume w.l.o.g. that $i=1$ (so $1\in I_{0}$ ). Now, we can write:

[TABLE]

Where: $g(x_{2},\dots,x_{k^{\prime}})=\sum_{I\subset[k^{\prime}],1\in I}\alpha_{I}\chi_{I\setminus\{1\}}({\bm{x}})$ .

and since $\alpha_{I_{0}}\neq 0$ and $1\in I_{0}$ we get $g\neq 0$ . Now, notice that since ${\bm{x}}\in\{0,1\}^{n}$ we get:

[TABLE]

And similarly: $\mathbb{E}_{\mathcal{D}_{w}}\left[f({\bm{x}})|x_{1}=1\right]=f_{w}(1,p_{2},\dots,p_{k^{\prime}})$ .

Therefore we get:

[TABLE]

Where $g_{0}$ is given by:

[TABLE]

Denote $k_{0}=\deg(p_{0})$ and note that $k_{0}\leq k^{\prime}-1$ . For some choice of $\beta_{I}$ -s. Notice that for some maximal $I\subset[k^{\prime}]$ with $1\in I$ and $\alpha_{I}\neq 0$ (so $|I|=k_{0}$ ), we have $\beta_{I}=2^{|I|}\alpha_{I}$ , so $|\beta_{I}|\geq\frac{2^{k_{0}}}{2^{k^{\prime}}}$ .

Now, denote $\xi_{i}=\frac{1}{c}\Delta_{i}$ , so we have $\xi_{i}\sim Uni([-1,1])$ , and observe the polynomial:

[TABLE]

And from what we have shown, $G_{0}$ is a polynomial of degree $k_{0}$ , and there exists $I$ with $|I|=k_{0}$ such that $|\gamma_{I}|\geq 1$ . Therefore, we can use Lemma 3 from [20] to get that:

[TABLE]

And therefore:

[TABLE]

So if we take $\epsilon=\delta^{2}\left(\frac{c}{2}\right)^{2k}$ we get that $\mathbb{P}\left[|g_{0}|\leq\epsilon\right]\leq\delta$ , which completes the proof.

∎

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. ar Xiv preprint ar Xiv:1811.04918 , 2018.
2[2] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. ar Xiv preprint ar Xiv:1811.03962 , 2018.
3[3] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. ar Xiv preprint ar Xiv:1901.08584 , 2019.
4[4] Avrim Blum. Rank-r decision trees are a subclass of r-decision lists. Information Processing Letters , 42(4):183–185, 1992.
5[5] Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, and Steven Rudich. Weakly learning dnf and characterizing statistical query learning using fourier analysis. In STOC , volume 94, pages 253–262, 1994.
6[6] Avrim Blum, Adam Kalai, and Hal Wasserman. Noise-tolerant learning, the parity problem, and the statistical query model. Journal of the ACM (JACM) , 50(4):506–519, 2003.
7[7] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over-parameterized networks that provably generalize on linearly separable data. ar Xiv preprint ar Xiv:1710.10174 , 2017.
8[8] Nader H Bshouty and Lynn Burroughs. On the proper learning of axis-parallel concepts. Journal of Machine Learning Research , 4(Jun):157–176, 2003.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

ID3 Learns Juntas for Smoothed Product Distributions

Abstract

1 Introduction

Related Work

1.1 Problem Setting

The ID3 Algorithm

Learning Juntas

1.2 Results

Learning Parities

Theorem 1**.**

Smoothed Analysis of Learning General Juntas

Theorem 2**.**

1.3 Open Question

Conjecture 1**.**

Open Question 1**.**

2 Proofs

2.1 General Approach

Lemma 1**.**

Lemma 2**.**

Lemma 3**.**

Theorem 3**.**

2.2 Proof of the basic lemmas

Proof.

Lemma 4**.**

Proof.

Proof.

Proof.

2.3 Parities

Lemma 5**.**

Theorem 4**.**

Proof.

2.4 Juntas

Lemma 6**.**

Theorem 5**.**

Proof.

Theorem 1.

Theorem 2.

Conjecture 1.

Open Question 1.

Lemma 1.

Lemma 2.

Lemma 3.

Theorem 3.

Lemma 4.

Lemma 5.

Theorem 4.

Lemma 6.

Theorem 5.