ID3 Learns Juntas for Smoothed Product Distributions
Alon Brutzkus, Amit Daniely, Eran Malach

TL;DR
This paper proves that the ID3 algorithm can efficiently learn k-Junta functions under smoothed analysis, demonstrating its effectiveness in noisy environments for functions depending on a logarithmic number of variables.
Contribution
It provides the first theoretical analysis showing ID3 learns k-Juntas in polynomial time when k = log n under smoothed analysis.
Findings
ID3 learns k-Juntas in polynomial time for k = log n
The analysis applies to noisy, smoothed distributions
Supports practical effectiveness of ID3 in noisy settings
Abstract
In recent years, there are many attempts to understand popular heuristics. An example of such a heuristic algorithm is the ID3 algorithm for learning decision trees. This algorithm is commonly used in practice, but there are very few theoretical works studying its behavior. In this paper, we analyze the ID3 algorithm, when the target function is a -Junta, a function that depends on out of variables of the input. We prove that when , the ID3 algorithm learns in polynomial time -Juntas, in the smoothed analysis model of Kalai & Teng. That is, we show a learnability result when the observed distribution is a "noisy" variant of the original distribution.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
ID3 Learns Juntas for Smoothed Product Distributions
Alon Brutzkus
&Amit Daniely &Eran Malach The Blavatnik School of Computer Science Tel Aviv University, Israel School of Computer Science The Hebrew University, Israel School of Computer Science The Hebrew University, Israel
Abstract
In recent years, there are many attempts to understand popular heuristics. An example of such a heuristic algorithm is the ID3 algorithm for learning decision trees. This algorithm is commonly used in practice, but there are very few theoretical works studying its behavior. In this paper, we analyze the ID3 algorithm, when the target function is a -Junta, a function that depends on out of variables of the input. We prove that when , the ID3 algorithm learns in polynomial time -Juntas, in the smoothed analysis model of [20]. That is, we show a learnability result when the observed distribution is a “noisy” variant of the original distribution.
1 Introduction
In recent years there has been a growing interest in analyzing machine learning algorithms that are commonly used in practice. A primary example is the gradient-descent algorithm for learning neural-networks, which achieves remarkable performance in practice but has very little formal guarantees. A main approach in studying such algorithms is proving that they are able to learn models that are known to be learnable. For examples, it has been shown that SGD can learn neural-networks when the target function is linear, or belongs to a certain kernel space [7, 34, 12, 13, 27, 1, 2, 3, 28, 25, 24].
In this paper we take a similar approach aiming to give theoretical guarantees for the ID3 algorithm [29]
- a popular algorithm for learning decision trees. We analyze the behavior of this algorithm when the target function is a -Junta, a function that depends only on bits from the input, and the underlying distribution is a product distribution, where the bits in the input examples are independent. While we cannot guarantee that the ID3 algorithm learns under any such distribution, as there are distributions which fail the algorithm, we show that the algorithm can learn “most” such distributions. That is, we show that for any product distribution and a -Junta, the ID3 algorithm learns the junta over a “noisy” variant of the original distribution. Such a result is in the spirit of smoothed analysis [33], which is often used to give results when a worst-case analysis is not satisfactory.
Related Work
There are a number of works studying the learnability of decision trees [30, 23, 4, 14, 9, 8, 10]. We next elaborate on papers that analyze decision trees under product distributions, as we do. The work of [20] gives learnability results of decision trees for product distributions with smoothed analysis, in a problem setting similar to ours. Their work analyzes an algorithm that estimates the Fourier coefficients of the target function in order to learn the decision tree. Another work [26] proves learnability of decision trees implementing monotone Boolean functions under the uniform distribution. Other algorithms for learning decision trees under the uniform distribution are given in [19, 18], again relying on Fourier analysis of the target function. Another work [11] gives an algorithm for learning stochastic decision trees under the uniform distribution. The work of [5] gives negative results on learning polynomial size decision trees under the uniform distribution in the statistical query setting.
While the above works study learnability of decision trees under various distributional assumptions, they all consider algorithms that are very different from algorithms used in practice. Our work, on the contrary, gives guarantees for algorithms that enjoy empirical success. In the current literature there are very few works that analyze such algorithms. Notably, the work of [17] studies the class of impurity-based algorithms, which contains the ID3 algorithm. This work shows that unate functions, like linear threshold function or read-once DNF, are learnable under the uniform distribution, using impurity-based algorithms. Our work, on the other hand, considers a different choice of target functions (Juntas), and shows learnability under “most” distributions, and not only for a fixed distribution. Another work that studies an algorithm used in practice [22] shows that the CART and C4.5 algorithms can leverage weak approximation of the target function, and thus can perform boosting. However, it is not clear whether such weak approximation typically happens, and in what cases this result can be applied. In contrast, our results apply for a concrete family of functions and distributions.
1.1 Problem Setting
The ID3 Algorithm
Let be the domain set and let be the label set. We next describe the ID3 algorithm, following the presentation in [31]. Define an impurity function to be any concave function , satisfying that and . Given an impurity function , a sample and an index , we define the gain measure to be as follows:
[TABLE]
Given a sample , the ID3 algorithm generates a decision tree in a recursive manner. At each step of the recursion, the algorithm chooses the feature to be assigned to a current node. The algorithm iterates over all the unused features, and calculates the gain measure with respect to the examples that reach the current node. Then, it chooses the feature that maximizes the gain. This algorithm is described formally in algorithm 1. The output of the algorithm is given by the initial call to .
Learning Juntas
A -Junta is a function that depends on coordinates. Namely, there is a set and a function such that . In this case, we will say that is supported in . Throughout the paper, we assume that the examples are sampled from a product distribution over , that is realizable by a -Junta. Namely, we assume that for , for some , and for some -Junta . The main goal of this paper is to show that for “most" product distributions, the ID3 algorithm succeeds to learn -Juntas in polynomial time. Namely, it will return a tree whose generalization error, , is small (in fact, zero). We note that the sample complexity of learning -Juntas is super polynomial, hence, -Juntas is the best that we can hope to learn in polynomial time.
1.2 Results
We will show two positive results for learning -Juntas. The first establishes learnability of parities, while the second is about learnability of general Juntas. Thruought, we assume that the impurity function is strongly concave and Lipschitz.
Learning Parities
A -parity is a function of the form , where is a set of indices. Note that any -parity is a -Junta. We first consider leranability of -parities by the ID3 algorithm111As opposed to general -Juntas, -parities with any are learnable in polynomial time. Yet, in the context of decision tree algorithms, we cannot hope to learn -parities with . Indeed, such parities cannot be computed, or even approximated, by a poly-sized tree..
Learning parity functions is a classical problem in machine learning, for which there exists an efficient algorithm[15, 16]. Still, parities often serve as a hard benchmark, as many common algorithms cannot learn these functions [6, 32]. In the case of the ID3 algorithm, when the underlying distribution is uniform (i.e, when for all ), the algorithm fails to learn parity functions [21]. We show that the case of the uniform distribution is in some sense unique. That is, we show that for every distribution that is not “too close” to the uniform distribution, the ID3 algorithm succeeds to learn any such parity function. To this end, we say that is -distributuion if and for any .
Theorem 1**.**
Fix . There is a polynomial222The polynomial depends on and the impurity function . See theorem 4 for a detailed dependency. for which the following holds. Suppose that the ID3 algorithm runs on examples from an -distribution that is realized by a -parity. Then, w.p. , ID3 will output a tree with .
Smoothed Analysis of Learning General Juntas
For general Juntas, instead of standard worst-case analysis, where we require that the algorithm succeeds to learn any distribution, we will show that the algorithm learns most distributions. Namely, for every fixed distribution, we show that the algorithm succeeds to learn, with high probability, a “noisy” version of this distribution. Formally, a smoothened -distribution is a random distribution where for some and .
Theorem 2**.**
Fix . There is a polynomial333The polynomial again depends on and the impurity function . See theorem 5 for a detailed dependency. for which the following holds. Suppose that the ID3 algorithm runs on examples from a smoothened -distribution that is realized by a -junta. Then, w.p. , ID3 will output a tree with .
1.3 Open Question
We now turn to discussing possible open questions and future directions arising from this work. Our main result applies for the case where the target function is a -Junta, which can be implemented by a tree of depth . An immediate open question is whether a similar learnability result can be shown for general trees of depth . We conjecture that this is indeed the case.
Conjecture 1**.**
Fix . There is a polynomial for which the following holds. Suppose that the ID3 algorithm runs on examples from a smoothened -distribution that is realized by a -depth-tree. Then, w.p. , ID3 will output a tree with .
As we previously mentioned, our work could be viewed in a broader context of understanding heuristic learning algorithms that enjoy empirical success. In this field of research, a main challenge of the machine learning community is to understand the behavior of neural-networks learned with gradient-based algorithms. While our analysis is focused on proving results for the ID3 algorithm, we believe that similar techniques could be used to show similar results for learning neural-networks with gradient-descent. Specifically, we raise the following interesting question:
Open Question 1**.**
Can gradient-descent learn neural-networks when the target function is a -Junta, in the smoothed analysis setting?
2 Proofs
2.1 General Approach
Throughout, we assume that is a distribution that is realized by a Junta , supported in , with . We assume w.l.o.g. that .
To prove our result, we will show that w.h.p., the algorithm chooses only variables from , and furthermore, any root-to-leaf path will contain all the variables from . In this case, the resulting tree will have zero generalization error. To formalize this, we will use the following notation. We define the support of a vector as
[TABLE]
and let
[TABLE]
For a sample , we denote
[TABLE]
Finally, for a distribution we denote
Lemma 1**.**
Suppose that the sample is realized by . Assume that for any with we have and either of the following holds:
- •
All examples in have the same label.
- •
For all and we have
Then, the ID3 algorithm will build a tree with zero loss on .
Not surprisingly, the gain of coordinates outside of is always small. This is formalized in the following lemma.
Lemma 2**.**
Assume that is -Lipschitz. Fix , with , and . Assume we sample with . Then with probability at least we have and:
[TABLE]
Given lemma 2, in order to apply lemma 1, it remains to show that the gain of the coordinates in is large. To this end, we will use a measure of dependence between a coordinate and the label , which we define next. For a sample and an index , we let . Similarly, for a distribution over we let . Note that and are independent if and only if . The following lemma connects to .
Lemma 3**.**
Assume is strongly concave (i.e, is strongly convex). Assume for some , with and index we have . Fix . Then, if we sample for , then with probability at least we have and:
[TABLE]
Combining lemmas 1, 2 and 3, we get the following theorem:
Theorem 3**.**
Assume is strongly concave and -Lipschitz. Assume for any , with we have and either of the following holds:
- •
All examples in have the same label.
- •
For every index we have .
Fix . Then, if we sample for , then with probability at least the ID3 algorithm will build a tree with zero loss on
By the above theorem, in order to show that the ID3 algorithm succeeds in learning, it is enough to lower bound . This is done in the remaining sections, together with the proof of lemmas 1, 2 and 3.
2.2 Proof of the basic lemmas
Proof.
(of lemma 1) At every iteration, the ID3 algorithm assigns a splitting variable for a given node, or otherwise returns a leaf for this node. We will show that for every node that the algorithm iterates on, if the path from the root to this node contains only variables from , then either the algorithm adds a splitting variable from , or the algorithm returns a leaf. Indeed, assume that the path from the root to this node contains only variables from . We can decode the root-to-node path by a vector , where if the node is in the path, if the node is in the path, and otherwise. Therefore, by our assumption we have . Note that in this case, the algorithm observes the sample , so if all examples in have the same label, then the algorithm returns a leaf. Otherwise, by the assumption we get , so the algorithm chooses a splitting variable from .
From the above, the algorithm adds only splitting variables from , so it can build a tree of size at most before stopping. This tree has zero loss on the distribution. Indeed, for any , denote such that for every and for every . Then, since we assume , there exists a sample such that for every . By definition, the algorithm returns a tree that correctly labels the example , therefore it returns a function that agrees with . Since this is true for every choice of , the function returned by the tree agrees with the Junta defined by , so it gets zero loss. ∎
We next relate the empirical measure to .
Lemma 4**.**
Fix , with , , . Let with . Then with probability at least we have and:
[TABLE]
Proof.
Denote . Let and . We have
[TABLE]
By Hoeffding’s bound, with probability , we have
[TABLE]
dividing by we get
[TABLE]
Notice that from the above we get that . It follows that
[TABLE]
Similarly,
[TABLE]
In this case, we have ∎
Proof.
(of lemma 2) Notice that since and are independent, we have . By the choice of , from Lemma 4 we get that with probability :
[TABLE]
Denote . Notice that if or then , and the result trivially holds. We can therefore assume . Now, we have the following:
[TABLE]
Similarly, we get:
[TABLE]
Using the -Lipschitz property, we get:
[TABLE]
And similarly:
[TABLE]
Now plugging into the gain definition:
[TABLE]
∎
Proof.
(of lemma 3) By the choice of , from Lemma 4 we get that with probability :
[TABLE]
Since we assume , we get that . Therefore, we have that (again denoting ). Observe that we have the following:
[TABLE]
Therefore, we have:
[TABLE]
Since is strongly concave we get that for all we have:
[TABLE]
Using this property we get that:
[TABLE]
Plugging this to the gain equation we get:
[TABLE]
Since , we get .
∎
2.3 Parities
Lemma 5**.**
Let be a distribution on labelled by with . Assume that for every we have and , for some . Fix some . Then for every we have:
[TABLE]
By theorem 3 we have
Theorem 4**.**
Let be a distribution on labelled by with . Assume that for every we have and , for some . Assume furthermore that is strongly concave and -Lipschitz.
Then, if we sample for , then with probability at least the ID3 algorithm will build a tree with zero loss on
Note that since we assume , the runtime and sample complexity in the above theorem are polynomial in . We give the proof of this theorem in the rest of this section.
Proof.
Denote , and . For simplicity of notation, assume w.l.o.g that and . Observe the following:
[TABLE]
Similarly, we get that:
[TABLE]
Therefore, we get that:
[TABLE]
∎
2.4 Juntas
Lemma 6**.**
Fix some , and assume not all examples in have the same label. Let . Assume for for every , and fix . Then there exists such that with probability over the choice of :
[TABLE]
By theorem 3 we get
Theorem 5**.**
Assume is strongly concave and -Lipschitz. Fix . Then, if we sample for , then with probability at least the ID3 algorithm will build a tree with zero loss on
Proof.
For simplicity of notation, we assume w.l.o.g. that for some . Denote , such that . Observe the Fourier coefficients of :
[TABLE]
Where , and note that is a Fourier basis (w.r.p to the unifrom distribution). Notice that for every . Indeed, we have:
[TABLE]
where , and this gives the required. Since not all examples in have the same label, we know that is not a constant function. Therefore, there exists such that . Fix some , and we assume w.l.o.g. that (so ). Now, we can write:
[TABLE]
Where: .
and since and we get . Now, notice that since we get:
[TABLE]
And similarly: .
Therefore we get:
[TABLE]
Where is given by:
[TABLE]
Denote and note that . For some choice of -s. Notice that for some maximal with and (so ), we have , so .
Now, denote , so we have , and observe the polynomial:
[TABLE]
And from what we have shown, is a polynomial of degree , and there exists with such that . Therefore, we can use Lemma 3 from [20] to get that:
[TABLE]
And therefore:
[TABLE]
So if we take we get that , which completes the proof.
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. ar Xiv preprint ar Xiv:1811.04918 , 2018.
- 2[2] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. ar Xiv preprint ar Xiv:1811.03962 , 2018.
- 3[3] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. ar Xiv preprint ar Xiv:1901.08584 , 2019.
- 4[4] Avrim Blum. Rank-r decision trees are a subclass of r-decision lists. Information Processing Letters , 42(4):183–185, 1992.
- 5[5] Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, and Steven Rudich. Weakly learning dnf and characterizing statistical query learning using fourier analysis. In STOC , volume 94, pages 253–262, 1994.
- 6[6] Avrim Blum, Adam Kalai, and Hal Wasserman. Noise-tolerant learning, the parity problem, and the statistical query model. Journal of the ACM (JACM) , 50(4):506–519, 2003.
- 7[7] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over-parameterized networks that provably generalize on linearly separable data. ar Xiv preprint ar Xiv:1710.10174 , 2017.
- 8[8] Nader H Bshouty and Lynn Burroughs. On the proper learning of axis-parallel concepts. Journal of Machine Learning Research , 4(Jun):157–176, 2003.
