A Tight Analysis of Greedy Yields Subexponential Time Approximation for Uniform Decision Tree
Ray Li, Percy Liang, Stephen Mussmann

TL;DR
This paper provides a tight analysis of the greedy algorithm for Uniform Decision Tree, showing its approximation ratio depends on the optimal cost, and introduces subexponential algorithms with implications for complexity theory.
Contribution
It establishes a precise approximation bound for greedy algorithms on Uniform Decision Tree and introduces subexponential algorithms, resolving a conjecture and connecting to Min Sum Set Cover.
Findings
Greedy algorithm achieves an $O(rac{ ext{log } n}{ ext{log } C_{OPT}})$ approximation.
Subexponential time algorithms with ratio $rac{9.01}{ ext{alpha}}$ for all $ ext{alpha} ext{ in}(0,1)$.
Achieving super-constant approximation ratios is unlikely to be NP-hard under ETH.
Abstract
Decision Tree is a classic formulation of active learning: given hypotheses with nonnegative weights summing to 1 and a set of tests that each partition the hypotheses, output a decision tree using the provided tests that uniquely identifies each hypothesis and has minimum (weighted) average depth. Previous works showed that the greedy algorithm achieves a approximation ratio for this problem and it is NP-hard beat a approximation, settling the complexity of the problem. However, for Uniform Decision Tree, i.e. Decision Tree with uniform weights, the story is more subtle. The greedy algorithm's approximation ratio was the best known, but the largest approximation ratio known to be NP-hard is . We prove that the greedy algorithm gives a approximation for Uniform Decision Tree, where is theā¦
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms Ā· Complexity and Algorithms in Graphs Ā· Formal Methods in Verification
A Tight Analysis of Greedy Yields Subexponential Time Approximation for Uniform Decision Tree
Ray Li , Percy Liang , Stephen Mussmann Department of Computer Science, Stanford University. Research supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE - 1656518. Email: [email protected] of Computer Science, Stanford University. Email: [email protected] of Computer Science, Stanford University. Research supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE - 1656518. Email: [email protected]
Abstract
DecisionĀ Tree is a classic formulation of active learning: given hypotheses with nonnegative weights summing to 1 and a set of tests that each partition the hypotheses, output a decision tree using the provided tests that uniquely identifies each hypothesis and has minimum (weighted) average depth. Previous works showed that the greedy algorithm achieves a approximation ratio for this problem and it is NP-hard beat a approximation, settling the complexity of the problem.
However, for UniformĀ DecisionĀ Tree, i.e.Ā DecisionĀ Tree with uniform weights, the story is more subtle. The greedy algorithmās approximation ratio was the best known, but the largest approximation ratio known to be NP-hard is . We prove that the greedy algorithm gives a approximation for UniformĀ DecisionĀ Tree, where is the cost of the optimal tree and show this is best possible for the greedy algorithm. As a corollary, we resolve a conjecture of Kosaraju, Przytycka, and Borgstrom [KPB99]. Our results also hold for instances of DecisionĀ Tree whose weights are not too far from uniform. Leveraging this result, for all , we exhibit a approximation algorithm to UniformĀ DecisionĀ Tree running in subexponential time . As a corollary, achieving any super-constant approximation ratio on UniformĀ DecisionĀ Tree is not NP-hard, assuming the Exponential Time Hypothesis. This work therefore adds approximating UniformĀ DecisionĀ Tree to a small list of natural problems that have subexponential time algorithms but no known polynomial time algorithms. Like the analysis of the greedy algorithm, our analysis of the subexponential time algorithm gives similar approximation guarantees even for slightly nonuniform weights. A key technical contribution of our work is showing a connection between greedy algorithms for UniformĀ DecisionĀ Tree and for MinĀ SumĀ SetĀ Cover.
1 Introduction
In DecisionĀ Tree (also known as Split Tree), one is given hypotheses with nonnegative weights summing to 1 and a set of -ary tests that each partition the hypotheses, and must output a decision tree using the provided tests that uniquely identifies each hypothesis and has minimum (weighted) average depth.111We require such a decision tree always exists in a valid DecisionĀ Tree instance. DecisionĀ Tree is a classic problem that arises naturally in active learning [Das04, Now11, GB09] and hypothesis identification [Mor82]. Active learning with a well-specified and finite hypothesis class with noiseless tests is precisely DecisionĀ Tree where the tests are data points and the answers are their labels. DecisionĀ Tree was first proved to be NP-hard by Hyafil and RivestĀ [HR76]. Since then, a large number works have provided algorithms for this question [GG74, Lov85, KPB99, Das04, CPR*+*11, CPRS09, GB09, GNR10, CJLM10, AH12].
A natural algorithm for DecisionĀ Tree is the greedy algorithm, which creates a decision tree by iteratively choosing the test that most evenly splits the set of remaining hypotheses. For binary tests (), there is a natural notion of āmost even split,ā but for , there are multiple possible definitions (see discussion in SectionĀ 2). It is well known that the greedy algorithm achieves an approximation ratio for DecisionĀ Tree assuming all weights are at least . It was first shown for binary tests and uniform weights ( for all ) [KPB99, AH12], then -ary tests [CPRS09], and finally, general non-uniform weights [GB09]. Furthermore, it is NP-hard to achieve a approximation ratio for DecisionĀ Tree [CPR*+*11], settling the complexity of approximating DecisionĀ Tree.
However, there are still gaps in our knowledge. For UniformĀ DecisionĀ Tree, i.e.Ā DecisionĀ Tree with uniform weights, the approximation given by the greedy algorithm was previously the best known approximation achievable in polynomial time. Chakaravarthy et al. [CPR*+*11] proved that it is NP-hard to give a approximation, giving the best known hardness of approximation result, and they asked whether the gap between the best approximation and hardness results could be improved. Previously, it was not even known whether the greedy algorithm could beat the approximation ratio in previous analyses: the best lower bound on the greedy algorithmās approximation ratio is [KPB99, Das04]. In the setting where the optimal solution to UniformĀ DecisionĀ Tree has cost , Kosaraju et al. [KPB99] showed that the greedy algorithm indeed gives an approximation, and they conjectured that the greedy algorithm gives an approximation in general.
For an extended discussion of related works, see SectionĀ 8.
1.1 Our contributions
We summarize the main contributions of our work below. The approximation guarantees of our algorithms are captured in FigureĀ 1.
- ā¢
Greedy algorithm. We give a new analysis of the greedy algorithm, showing that it gives an approximation for DecisionĀ Tree, where is the cost of the optimal tree, , and . This implies an approximation for instances of UniformĀ DecisionĀ Tree and of DecisionĀ Tree whose weights are close to uniform. As always, this proves the conjecture of Kosaraju et al. [KPB99].
- ā¢
Subexponential time algorithm. Leveraging the above greedy analysis, for , we give a subexponential222Throughout this work, subexponential means for some absolute . We make a distinction when referring to runtimes. -time approximation algorithm for UniformĀ DecisionĀ Tree. Assuming the Exponential Time Hypothesis (ETH) [IP01, IPZ01]333ETH states that there are no time algorithms for 3SAT., this algorithm implies that any superconstant approximation of UniformĀ DecisionĀ Tree is not NP-hard. Our work adds approximating UniformĀ DecisionĀ Tree to a small list of natural problems whose time complexity is known to be subexponential (and, for some approximation ratios, ) but not known to be polynomial. Examples of such problems include Factoring [LLMP93], UniqueĀ Games [Kho02, ABS15], GraphĀ Isomorphism [Bab16], and approximating Nash Equilibrium [LMM03, Rub18], with the later two having -time algorithms. Like in our analysis of the greedy algorithm, our subexponential time algorithm gives a similar approximation guarantee even for slightly nonuniform weights, in particular when .
- ā¢
Approximation ratio tightness. We prove that the approximation ratio for the greedy algorithm is tight for UniformĀ DecisionĀ Tree. We also prove that the term in the approximation ratio for the greedy algorithm is necessary, in the sense that no algorithm can give a approximation for DecisionĀ Tree when for some unless P=NP.
- ā¢
Repeatable, noisy tests. Kääriäinen[Kää06] provides a method to convert a solution for Decision Tree into a solution for a variant of Decision Tree that handles noisy, repeatable tests. An immediate corollary of our result for the greedy algorithm is that the cost of a solution for the noisy problem derived from the greedy algorithm is at most . Previously, this cost was bounded by .
1.2 Techniques
Our work gives a new analysis of the greedy algorithm for DecisionĀ Tree. A key technical contribution of this work is to leverage upper bounds of MinĀ SumĀ SetĀ Cover and SetĀ Cover for (Uniform)Ā DecisionĀ Tree. Previously, only connections in the reverse direction (i.e. lower bounds) were known between these problems: NP-hardness of attaining a -approximation for UniformĀ DecisionĀ Tree was proved by reduction from MinĀ SumĀ SetĀ Cover, and NP-hardness of attaining a approximation for DecisionĀ Tree was proved by reduction from SetĀ Cover [CPR*+*11].
At a high level, our analysis goes as follows. By a simple double counting argument, we can compute the cost of a tree by summing the āweightsā of the treeās interior vertices, rather than summing the depths of the hypotheses. However, rather than accounting for all the interior vertices at once, we separately analyze the vertices with āimbalancedā splits and those with ābalancedā splits. Carefully choosing the definition of balanced and imbalanced is a key idea of the proof: previous analyses of the greedy algorithm [KPB99, CPR*+*11, GB09, AH12] either make no distinction between interior vertices or use a different distinction. A global entropy argument accounts for the vertices with balanced splits. For the vertices with imbalanced splits, we use the fact that the greedy algorithm gives a constant factor approximation for MinĀ SumĀ SetĀ Cover [FLT04]. For UniformĀ DecisionĀ Tree, putting the two bounds together gives the desired approximation result. For the general DecisionĀ Tree problem, we additionally prove and use a generalization of a result on the greedy algorithmās performance for SetĀ Cover [Lov75, Joh74, Chv79, Ste74].
For the subexponential time algorithm, we leverage our new result that the greedy algorithm gives an approximation. We first run the greedy algorithm. If the greedy algorithm returns a tree with cost at least , we return the greedy tree knowing we have an approximation. Otherwise, we find by brute force the āoptimal tree up to depth ā in time , then recurse.
1.3 Organization of paper
In SectionĀ 2, we formally introduce notation used throughout the paper. In SectionĀ 3, we state our results. In SectionĀ 4, we sketch a proof of TheoremĀ 3.1, that the greedy algorithm gives an approximation on DecisionĀ Tree. Since the proof of TheoremĀ 3.1 is involved, we prove the special case of TheoremĀ 3.1 for UniformĀ DecisionĀ Tree with binary tests in SectionĀ 6, and give the full proof in AppendixĀ A. In SectionĀ 5, we state the subexponential time approximation algorithm and give a sketch of the analysis, and we give a formal analysis in SectionĀ 7. In SectionĀ 8, we describe some related work. In SectionĀ 9, we conclude with some open problems.
We leave some details to the appendices. A lemma on the greedy algorithmās performance in a generalization of SetĀ Cover that is used in the proof TheoremĀ 3.1 is proved in AppendixĀ B. In AppendixĀ C, we prove PropositionsĀ 3.3 and 3.4, which show two ways that TheoremĀ 3.1 is tight. In AppendixĀ D, we demonstrate a rounding trick that allows us to assume without changing the difficulty of approximating DecisionĀ Tree.
2 Preliminaries
For a positive integer , let . All logs are base 2 when the base is not specified. The DecisionĀ Tree problem is as follows: given a set of hypotheses with probabilities summing to 1, and distinct -ary tests, output a decision tree with hypotheses as leaves, such that the weighted average of the depth of the leaves is minimal. Formally, a -ary test is a map . We refer to as the branching factor of the test , and the elements of as the possible answers to the tests. We think of a test as defining a -way partition of . A decision tree is a rooted tree such that each interior vertex has the index of some test, and the edge to the -th child of is labeled with . We say that a hypothesis is consistent with a vertex if, in the root-to- path, the edge following any vertex has label . We let denote the set of hypotheses that are consistent with . We say a decision tree is complete if, for all , there exists a (unique) leaf such that , and for a complete decision tree , let denote the depth of this vertex . The cost of a complete decision tree is defined to be the average depth of the leaves, weighted by , i.e.
[TABLE]
We set to be a complete decision tree that minimizes (in general, there may be more than one optimal decision tree), and abbreviate .
This paper is concerned with the greedy algorithm for DecisionĀ Tree. We call a decision tree greedy if the test of each interior vertex minimizes the (weighted) number of hypotheses of the largest partition in ās partitioning of . Formally, a decision tree is greedy if, for all interior vertices , we have
[TABLE]
where for . Given a UniformĀ DecisionĀ Tree instance, we let be a complete, greedy decision tree, choosing one arbitrarily if there is more than one. For brevity, we write .
We remark that, when , our notion of a āgreedyā algorithm for DecisionĀ Tree is not the only one. As mentioned in the previous paragraph, our definition of greedy chooses, at each vertex in the decision tree, the test that minimizes the (weighted) number of candidate hypotheses, assuming a worst-case answer to the test. Our definition corresponds to the definition by [CPRS09], but other choices include maximizing the (weighted) number of pairs of hypotheses that are distinguished [CPR*+*11, GB09] and maximizing the mutual information between the test and the remaining hypotheses [ZRB05]. For binary tests, , these definitions are all equivalent.
Define as DecisionĀ Tree with the guarantee that . In this notation, is UniformĀ DecisionĀ Tree.
3 Our results
3.1 Greedy algorithm
The main driver of this paper is TheoremĀ 3.1, which relates the cost of the greedy algorithm to the optimal cost for DecisionĀ Tree.
Theorem 3.1**.**
For any instance of DecisionĀ Tree on hypotheses, we have
[TABLE]
Our theorem holds for any branching factor , and when is the cost of an arbitrary tree produced by the greedy algorithm above. As always, our result implies that the greedy algorithm always gives an approximation for UniformĀ DecisionĀ Tree when the branching factor is a constant, resolving the conjecture of [KPB99]. Additionally, if is for constant and the weights are uniform, then the greedy algorithm obtains a constant approximation. We use this fact crucially in designing our subexponential time approximation algorithms.
For the simpler case when and the weights are uniform, we give a sketch of the proof in SectionĀ 4 and a full proof in SectionĀ 6. This full result is sketched in SectionĀ 4 and proven in full in AppendixĀ A. For UniformĀ DecisionĀ Tree, the constant 12 can be improved to 6, and, when is sufficiently large, , so that greedy gives a approximation (see SectionĀ 4).
Note that the terms and in the approximation ratio can be arbitrarily large. However, a rounding trick before running the greedy algorithm [GB09] allows us assume that all the weights are at least , and hence and in (3). The details are given in AppendixĀ D.
3.2 Subexponential time algorithm
Using TheoremĀ 3.1, we give a subexponential time algorithm that achieves a constant factor approximation for the DecisionĀ Tree problem when the weights are close to uniform.
Theorem 3.2**.**
For any and , there exists an approximation algorithm for with runtime . For UniformĀ DecisionĀ Tree, for any , we can achieve a approximation in the same runtime.
In SectionĀ 5, the subexponential time algorithm is stated and an analysis is sketched. The analysis is given formally in SectionĀ 7. Importantly, this result implies that achieving a super-constant approximation ratio is not NP-hard, given the Exponential Time Hypothesis. As an informal proof, suppose for contradiction there was a polynomial reduction from 3-SAT to achieving a approximation ratio for UniformĀ DecisionĀ Tree for some as . By TheoremĀ 3.2, there exists a -time algorithm to achieve a approximation for UniformĀ DecisionĀ Tree, and thus a -time algorithm to solve 3-SAT, contradicting the Exponential Time Hypothesis. This adds approximating UniformĀ DecisionĀ Tree to a list of interesting natural problems that have subexponential or time algorithms but are not known to be in P. FigureĀ 1 illustrates the contrast between DecisionĀ Tree and UniformĀ DecisionĀ Tree.
3.3 Approximation ratio tightness
We also show that the approximation ratio is tight up to a constant factor for the greedy algorithm by generalizing the example given by [Das04]. The proof is given in AppendixĀ C.1.
Proposition 3.3**.**
There exists an such that for all and any , there exists an instance of UniformĀ DecisionĀ Tree with branching factor 2 for which
[TABLE]
We also show that, when the weights are non-uniform, the term in the approximation ratio of TheoremĀ 3.1 is computationally necessary.
Proposition 3.4**.**
Let . Then, for sufficiently large, approximating to a factor of is NP-hard.
In other words, even if the ratio is guaranteed to be for a constant , one cannot give a approximation algorithm unless . The proof is given in AppendixĀ C.2.
3.4 Decision tree with noise
Theorem 3.1 implies an improved black-box result for a noisy variant of Decision Tree. Kääriäinen [Kää06] considers a variant of Decision Tree with binary tests where the output of each test may be corrupted by i.i.d. noise. Formally, there exists such that querying any test on any hypothesis , outputs the correct answer with probability and the wrong answer with probability , for some . Tests are repeatable, with each one producing different draws of the noise. Kääriäinen [Kää06] gives an algorithm that turns a decision tree of cost for the noiseless problem into a decision tree with cost for the noisy problem by repeating queries sufficiently many times.
Combining KƤƤriƤinenās result with the greedy algorithm for UniformĀ DecisionĀ Tree gives an algorithm for the noisy problem using an average of queries. Previously, using the bound , the noisy problemās cost was bounded by . However, by TheoremĀ 3.1, we have , so we in fact have cost at most , improving the cost ratio to the optimal solution of the noiseless problem by a nearly quadratic factor.
4 Sketch of proof of TheoremĀ 3.1
In this section, we sketch a proof of TheoremĀ 3.1. We first sketch the proof assuming that the branching factor is 2, so that is a binary tree, and that the distribution is uniform ( for all ). Since the proof of TheoremĀ 3.1 is involved, we give the details of this easier result in SectionĀ 6. At the end of the section, we give the additional ideas necessary to complete the full proof of TheoremĀ 3.1. The details of the full proof are given in AppendixĀ A.
4.1 Uniform weights and binary tests
Recall that and that, as the weights are uniform, for all . By a simple double counting argument (LemmaĀ 6.2), we can compute the cost of the greedy tree by summing the weights of the vertices rather than summing the depths of leaves. That is,
[TABLE]
where the sum is over the interior vertices of .
Defining balanced and imbalanced vertices.
We then define balanced and imbalanced vertices with respect to a parameter , which we eventually set to . These definitions are crucial to the proof. A vertex is imbalanced444We remark that imbalanced vertices can have arbitrarily close to , so the hypotheses at vertex are not necessary split in an imbalanced way. However, as we show (LemmaĀ 6.7), all balanced vertices are in fact split in a balanced way with , hence the terminology. if there exists an integer (called the level) such that and . Here, is the child of containing a smaller weight of hypotheses in its subtree. We say is balanced if it is not imbalanced. Note that imbalanced vertices exist only for , where . We prove a structural result (LemmaĀ 6.5) that shows that the level- imbalanced vertices of can be partitioned into downward paths, which we call chains, such that, for all , each leaf has vertices from at most one level- chain among its ancestors. The parameter quantifies how many chains we consider: smaller means fewer, longer chains, and larger means more, shorter chains. We optimize the choice of at the end of this proof sketch. In the remainder of the proof, we bound the weight of the balanced and imbalanced vertices separately.
Bounding the weight of balanced vertices.
To bound the weight of balanced vertices, we use an entropy argument. We consider the random variable corresponding to a uniformly random hypothesis from . On one hand, this random variable has entropy . On the other hand, we can take a uniformly random hypothesis from by an appropriate random walk down the decision tree. Starting from the root, at each vertex, we step to a child with probability proportional to the number of hypotheses in that childās subtree. The total entropy of this process is given by , where is the entropy of the random walkās step at . A simple argument (LemmaĀ 6.7) shows that, for all balanced vertices , we have and hence . We thus have
[TABLE]
Hence,
[TABLE]
Bounding the weight of imbalanced vertices.
To bound the cost of imbalanced vertices, we crucially use a connection to MinĀ SumĀ SetĀ Cover (MSSC). In MSSC, one is given a universe and sets , and needs to construct an ordering of the sets that minimizes the cost: the cost of a solution is the average of the cover times of the elements in the universe . That is, the cost of a solution is
[TABLE]
A result by Feige, Lovasz, and Tetali shows that the greedy algorithm gives a 4 approximation of MSSC, and they show this is tight by proving that finding a approximation of MSSC is NP-hard. On the lower bound side, a connection between MSSC and DecisionĀ Tree was already known: Chakaravarthy et al. [CPR*+*11] proved that it is NP-hard to approximate UniformĀ DecisionĀ Tree with ratio between than by a reduction to MSSC. The key technical contribution of our work is showing that there is also a connection on the upper bound side. Bounding the weight of imbalanced vertices works as follows.
For each chain , define a corresponding instance (DefinitionĀ 6.9) of
MinĀ SumĀ SetĀ Cover induced by the chain as follows:
- ā¢
Universe , the set of all hypotheses that are consistent with .
- ā¢
For , the set is the set of hypotheses in that give the minority answer of test with respect to hypotheses . (See FigureĀ 2).
- ā¢
For each , a set . These tests are included for technical reasons.
Note we have a total of sets, so that a solution is a permutation . The sets for are chosen so that the second step below holds. 2. 2.
Prove that the weight of a chain is bounded by the cost of a greedy solution to MSSC*(P)* (LemmaĀ 6.13), and hence, using a result of Feige, Lovasz, and Tetali (TheoremĀ 6.12), by 4 times the optimal cost of MSSC*(P)* (CorollaryĀ 6.14). That is, there exists a greedy solution to such that
[TABLE]
This step is somewhat technical, as one must show that the greediness of the greedy decision tree produces a greedy solution to . The choice of is natural: for , let be the index of the test used at vertex in the chain (see FigureĀ 3). However, showing that this is in fact a greedy solution to is a subtle argument that depends on the carefully chosen definition of a chain.
Prove that, for any integer , the sum, over all level- chains , of optimal cost of MSSC*(P)*, is bounded by (LemmaĀ 6.15). Hence,
[TABLE]
This step is also technical, as one must draw the connection between the optimal MSSC solution and the optimal decision tree. 4. 4.
In total, we have
[TABLE]
where the first inequality is by part 2 and the second inequality is by part 3. In other words, for any integer , the sum of the weights of all level chains is at most . Hence, the sum of the weights of vertices in any chain, and thus the total weight of all imbalanced vertices, is at most (LemmaĀ 6.16), where is the number of levels. As , we have
[TABLE]
To finish the proof, we bound
[TABLE]
The above is optimized roughly when , giving the desired bound of . If is sufficiently large, taking yields .
4.2 General weights and larger
The proof of the general TheoremĀ 3.1 follows similarly to the specific case given above. The two differences are that TheoremĀ 3.1 is stated for general and for general, not-necessarily-uniform distributions .
Adapting the proof to general is the easier step. The main difference is the definition of an imbalanced vertex. Now, we say a vertex is imbalanced if there is an integer such that and , where is the total weight of hypotheses in the subtrees of all children of except the majority vertex, , the child of with the largest weight of hypotheses. Under this definition, a similar analysis follows. Note that could be much larger than in this case, but this does not affect the proof much. A little more care needed in the entropy argument for balanced vertices, and with the MSSC instance defined by a path now taking to be all hypotheses that do not take the majority answer of with respect to the MSSC universe. Note that, if we specialize to , the value is simply .
In the weighted case, we again define to be imbalanced if there is an integer such that and . We again bound the cost of the balanced vertices by an entropy argument, and the cost of the imbalanced vertices via a connection to Min-Sum-Set-Cover. However, because the entities are now weighted, we need to consider the greedy algorithm for a weighted generalization of MSSC called Weighted Min-Sum-Set-Cover (WMSSC). In order to make the condition between the greedy decision tree and the greedy solution to WMSSC, we need a somewhat technical definition: call a vertex is -heavy if is consistent with and . Define if there exists such that is -heavy, and set otherwise. One can easily check that, for any vertex , there is at most one such that is -heavy, so is well defined. Now, we follow the argument in the uniform case, bounding
[TABLE]
where and are the greedy solution and optimal solution, respectively, to the corresponding WMSSC. The first inequality holds because for all and every imbalanced vertex is in some chain555It is inequality because some imbalanced vertices may be in multiple chains. The second inequality holds by a technical lemma (LemmaĀ A.15) comparing the greedy decision tree with a greedy solution to WMSSC. Just as for MSSC, the greedy algorithm gives a 4 approximation for WMSSC, so the third inequality holds. Additionally, for all , we can still bound , the sum of all WMSSC costs in a single level, by , so the fourth inequality holds. To finish:
[TABLE]
The last inequality (LemmaĀ A.20) comes from comparing, for fixed , the vertices of the greedy tree that are -heavy to an appropriate SET-COVER instance, and using the fact that the greedy algorithm on a weighted generalization of SET-COVER gives a approximation (TheoremĀ A.19).
5 Sketch of proof of TheoremĀ 3.2
5.1 Algorithm
We describe the algorithm that achieves a approximation for . In SectionĀ 7, we give the details and describe how the same algorithm with minor adjustments gives an improved approximation guarantee for UniformĀ DecisionĀ Tree.
The key idea in the algorithm is that, if the optimal tree has cost at least , then the greedy algorithm gives an approximation by TheoremĀ 3.1. Fix . Our algorithm first computes the greedy tree. If the cost of the greedy tree is at least , we simply return the greedy tree. Otherwise, we perform an exhaustive search over decision trees of depth at most such that all hypotheses not consistent with vertices at depth are uniquely distinguished. We choose such a tree with minimum cost (see definition of below). Finally, at each leaf of at depth , we recursively compute a decision tree that distinguishes the hypotheses consistent with . The runtime of this algorithm is dominated by the exhaustive search, which we can solve in time using a divide-and-conquer algorithm.
Let denote the cost of a decision tree with respect to hypothesis set , given by
[TABLE]
where is the depth of the deepest vertex of consistent with . In this way, we have . To solve the DecisionĀ Tree instance, we run Fulltree below.
5.2 Analysis sketch
We now sketch an analysis of the algorithm. First, it is easy to check that FullTree returns a valid decision tree. By TheoremĀ 3.1, when the greedy tree is used in the recursive call FullTree, it gives an approximation to the instance induced by . Hence, by careful bookkeeping, the greedy trees included in the output tree contribute at most to the cost (LemmaĀ 7.4). If the greedy tree is not used, then, in the optimal tree, the weighted average depth of the hypotheses is at most . Hence, by a simple counting argument, at each recursive call, the fraction of undistinguished hypotheses shrinks by a factor of , so the maximum depth of recursive calls is (LemmaĀ 7.6). Careful bookkeeping shows that, for any , the outputs to PartialTree called from the th level of recursion collectively contribute at most to the cost of the output tree (LemmaĀ 7.5). Hence, the trees computed by exhaustive search across all levels of recursion contribute a cost of . Hence, the cost of our output tree is .
6 Proof of TheoremĀ 3.1 for uniform weights and
We prove a special case of TheoremĀ 3.1 when and the weights are uniform, that is, we show that the UniformĀ DecisionĀ Tree with binary tests gives an approximation. Throughout this section, we have a UniformĀ DecisionĀ Tree instance with hypotheses and tests .
Theorem 6.1**.**
For any instance of the UniformĀ DecisionĀ Tree problem on hypotheses with branching factor 2, and any greedy tree with average cost , we have
[TABLE]
6.1 Notation
We use the following notation for our proof. These notations help us reason about the greedy tree. We write to mean that is a vertex of tree , and we write to mean that is a interior vertex. We say the length of a path in the tree is the number of edges along the path. For , we say is an ancestor of if there is a (possibly degenerate) path from to going down the tree. In particular, is an ancestor of . We write this as . We call a descendant of if and only if is an ancestor of . For , let denote the set of hypotheses consistent with . For a subset of hypotheses, denote its weight or cost by . For brevity, let , denoting the weight of vertex , and we say the weight of a set of vertices is the sum of the weights of the individual vertices in the set.
6.2 The basic argument
The following lemma shows that, rather than accounting the cost of the greedy tree by summing the depths of the leaves associated with the hypotheses, we can instead account the cost by summing the weights of vertices of the tree.
Lemma 6.2**.**
We have .
Proof.
We have,
[TABLE]
where, in the third equality, we switched the order of summation. ā
At a high level, our proof defines balanced and imbalanced vertices (next subsection) using a parameter and bound the weight of the balanced and imbalanced vertices separately. We bound the weight of the balanced vertices by an entropy argument, and the weight of the imbalanced vertices by partitioning the imbalanced vertices into paths, called chains, and bounding the weights of each chain separately. Overall, we get the following bound.
[TABLE]
Choosing gives .
For the rest of the proof, fix . Additionally, for convenience and without loss of generality, assume that our instance is nontrivial, i.e.Ā there is some test such that both of and have at least 2 hypotheses, as otherwise the greedy tree is optimal and and the theorem is true.
6.3 More notation: Majority and minority answers
We define majority (minority) answers, edges, children. These definitions are useful for defining balanced and imbalanced vertices. We later show that imbalanced vertices form paths whose edges are majority edges. We call these paths chains. We then analyze the balanced and imbalanced vertices separately, and in particular analyze each path of majority edges separately.
For each vertex in the greedy tree, let denote the test used at . For each vertex , label its children by and so that , with ties broken666any tiebreaking procedure suffices, as long as the tiebreaking is consistent with the and notation in the next paragraph. by labeling by the vertex corresponding to a test outputting 1.777it is possible to have a vertex that has one child, namely a test that doesnāt distinguish any pairs of hypotheses at a vertex, but such a test is useless and never appears in either the greedy or optimal tree, so we assume such vertices donāt exist. Accordingly, we have for all . Call the edge from to a majority edge, and the edge from to a minority edge. This is illustrated in FigureĀ 4.
In order to reason about the greedy tree precisely, we use the following notation which is more technical. For test and hypotheses , let be the answer to test that accounts for the maximum weight of hypotheses in , and let be the other index, with ties broken by . In other words, and are chosen so that . We call the majority answer of test with respect to hypothesis set . Call the other answer the minority answer of test with respect to hypothesis set . For all and , let
[TABLE]
We think of () as the set of hypotheses that, under test , output the majority (minority) answer to test with respect to set . Note that, with the above notation, we have and .
The following is a key property of the greedy tree : the weight of hypotheses consistent with the minority child decreases as we descend the tree.
Lemma 6.3**.**
For any vertices of with a descendant of , we have .
Proof.
Because was constructed greedily, for all , the test was chosen to maximize the weight of , the hypotheses in giving the minority answer . Hence, any other test, in particular, the test chosen at vertex , has a smaller weight of hypotheses of that give the minority answer of with respect to hypotheses . Hence, we have . Hence,
[TABLE]
The second inequality holds because . The third inequality holds because test defines a partition of into two parts, and is one of the two parts, so is one of or . ā
6.4 Defining balanced and imbalanced vertices
In the following definition, we identify balanced vertices and imbalanced vertices. By LemmaĀ 6.2, we can separately bound the weights of the balanced and imbalanced vertices.
Definition 6.4**.**
Let be a positive integer.
We say a vertex is level- imbalanced if and . 2. 2.
We say a vertex is imbalanced if it is level- imbalanced for some , and balanced otherwise. 3. 3.
We say a level- imbalanced vertex is minimal if no descendant of is also level- imbalanced vertex, and a level- imbalanced vertex is maximal if no ancestor of is level- imbalanced.
Let
[TABLE]
and note that level- imbalanced vertices exist only for . The following lemma proves a structural result about balanced vertices, with the punchline being item (iii), which permits DefinitionĀ 6.6. For an illustration, see FigureĀ 5.
Lemma 6.5**.**
Let be a positive integer.
- (i)
If is a level-* imbalanced vertex, then, among the children of , only can be a level- imbalanced vertex.* 2. (ii)
Additionally, if and are level-* imbalanced vertices and is an ancestor of , then every vertex on the path from to is a level- imbalanced vertex.* 3. (iii)
Finally, the set of level-* imbalanced vertices can be partitioned into vertex disjoint paths, each of which connects a maximal level- imbalanced vertex to a minimal level- imbalanced vertex and contains only majority edges.*
Proof.
For (i), note that if is level- imbalanced, then , so cannot be level- imbalanced. Hence, among the children of , only can be level- imbalanced.
For (ii), let be three vertices in the tree. Suppose that and are level- imbalanced. We know that , and LemmaĀ 6.3 gives . Hence is level- imbalanced.
For (iii), note that each level- imbalanced vertex has a maximal level- imbalanced ancestor (possibly itself), so we may partition the level- imbalanced vertices into sets based on their maximal level- imbalanced ancestor. We claim each set in the partition is a connected path. Let be the (unique) maximal level- imbalanced vertex in a set . For , if has a level- imbalanced child, let be that child, which is unique by the first item and in by definition. Let be the largest index such that is defined. Then has no level- imbalanced children. We claim are the only vertices in the set . Suppose not. Let be the largest index such that has a level- imbalanced descendant not among . Then, by the second item, every vertex on the path from to is level- imbalanced. If , this means has a level- imbalanced child, a contradiction of the maximality of . Otherwise, as is maximal, is not on the path from to , in which case, by (ii), is level- imbalanced, which contradicts (i). Thus, we always have a contradiction, so is the path . By (i), every edge along is a majority edge. This completes the proof. ā
LemmaĀ 6.5 motivates the following definition.
Definition 6.6**.**
Let be a positive integer. A level- chain, , is a path of level- imbalanced vertices starting at a maximal level- imbalanced vertex and ending at a minimal level- imbalanced vertex. By LemmaĀ 6.5, the level- chains partition the level- imbalanced vertices. We therefore let denote the set of level- chains.
In general, for , a level- chain might overlap with a level- chain.
6.5 Bounding the weight of balanced vertices
We first prove a lemma that justifies the choice of the word ābalancedā.
Lemma 6.7**.**
For every balanced vertex , we have .
Proof.
Assume for contradiction that . Let be the real number such that . In this way, . Then
[TABLE]
This implies that is level- imbalanced, so is imbalanced, a contradiction. ā
We now bound the contribution of the balanced vertices to the weight using an entropy argument.
Lemma 6.8**.**
We have
[TABLE]
Proof.
For a vertex with a test of index , let denote the binary random variable equal to for an hypothesis chosen uniformly at random from the hypotheses of . Let denote the entropy of a random variable. The entropy of a uniformly random hypothesis in equals . On the other hand, we can pick a uniformly random hypothesis in by starting at the root vertex , sampling an answer for the test at , stepping to , the child of corresponding to the chosen answer, and repeating with , until we reach a leaf. In this process, at any vertex , the probability of stepping to a child is exactly . Hence, by a simple induction, the probability of reaching any vertex in the tree during this process is exactly . The total entropy of this process is thus , as is the probability of reaching vertex . For a balanced vertex , LemmaĀ 6.7 implies . Hence,
[TABLE]
We conclude
[TABLE]
and rearranging gives the desired result. ā
6.6 Bounding the weight of imbalanced vertices
We now bound the weight of imbalanced vertices using a connection to MSSC.
6.6.1 Defining Min Sum Set Cover
Recall that and for all and .
Definition 6.9**.**
Let denote the instance of MinĀ SumĀ SetĀ Cover that is induced by the chain . This instance is given by
- ā¢
universe ,
- ā¢
for , sets , and
- ā¢
for each , a singleton set .888Some of these sets are empty, but we include them for notational convenience.
Note we have a total of sets. A solution to the instance is a permutation corresponding to an ordering of the sets, and the cost of a solution is the average of the cover times of the elements in the universe . Formally,
[TABLE]
Note that the cost of any solution is finite, as each hypothesis is in some set . We sometimes refer to a solution by the sets .
Remark 6.10**.**
Since the initial UniformĀ DecisionĀ Tree instance always has a solution, any two hypotheses can be distinguished by one of the tests. Hence, there is at most one hypothesis such that, for all , we have . In other words, all but one of the sets for is not used.
Definition 6.11**.**
We say a solution to is greedy at index if the set covers the maximum number of elements not covered by sets . We say a solution is greedy if it is greedy at index for all ,
Note that, in the case of ties, there may be multiple greedy solutions to . Note also that, for any partial assignment , one can always complete the solution greedily, so that is greedy at indices . Definition 6.11 lets us leverage the following theorem, due to Feige, LovÔsz, and Tetali.
Theorem 6.12** (Theorem 1 of [FLT04]).**
The greedy algorithm gives a 4-approximation to the MSSC problem. Formally, let be any greedy solution to the instance , and let denote an optimal solution to . We have
[TABLE]
6.6.2 Bounding chain weight above by MSSC cost
The section shows that the weight of a chain is bounded by the cost of its corresponding MSSC instance. To do this, we need the following technical lemma which shows that following the choices of the greedy tree yields a greedy solution to the MSSC, and hence the two weights are comparable.
Lemma 6.13**.**
Let be a positive integer and let be a level- chain. Then there exists a greedy solution to , such that
[TABLE]
Proof.
Let . Let be the universe and be the sets of the instance . For , let be the test used at vertex . Define a solution to by setting for , and completing the solution greedily. We claim is a greedy solution. To prove this, we show the following.
- (i)
For all and , the majority answer for test with respect to is the same as the majority answer for test with respect to . Equivalently, for all and , we have . As an immediate consequence, we know contains all of and none of . 2. (ii)
The set of hypotheses of not covered by is exactly . 3. (iii)
For each , among sets , the set covers the maximum number of hypotheses in . i.e. we have
[TABLE]
These points suffice, as (ii) and (iii) tell us that is greedy at indices , so by construction is greedy.
To show (i), fix and . As is level- imbalanced, we also have and and , so accounts for more than half of the hypotheses in . On the other hand, as is level- imbalanced, we have , so the majority answer for test with respect to hypothesis set also accounts for more than half of the hypotheses in . Hence for all and .
Item (ii) follows because is the set of hypotheses consistent with , which was obtained by following the majority edges from . This means contains all the hypotheses of not consistent with a minority child of one of . By (i), this is exactly .
For (iii), at vertex in the greedy decision tree, the test index maximizes the weight . By (i), this index equivalently maximizes , as desired. This completes the proof that is greedy.
We now return to the proof of LemmaĀ 6.13. Take the greedy solution given above. For , the set of vertices of not covered by is exactly , which has weight . Hence, by (27),
[TABLE]
as desired. ā
By TheoremĀ 6.12, we have the following immediate corollary.
Corollary 6.14**.**
Let be any chain, and be the optimal solution to . Then
[TABLE]
6.6.3 Bounding MSSC cost above by
We now show that the optimal MSSC solution can be compared to the optimal decision tree cost, . For all chains , let be a greedy solution to given by LemmaĀ 6.13, and let be an optimal solution to .
Lemma 6.15**.**
Let be a positive integer. We have
[TABLE]
Proof.
Let be the universe of the instance , and let be the sets. Construct a path in such that is the root and is a leaf, which is identified with some hypothesis , and, for , if the test at vertex has index , the edge to its child corresponds to the answer , the majority answer of test with respect to set . Since we follow the edges with label , this corresponds to following the path for an hypothesis contained in . In other words, we have, for ,
[TABLE]
Thus the sequence covers , and thus gives a valid solution to the instance , where , and on larger indices is arbitrarily chosen. Note that the depth of the leaf for hypothesis is at least the number of vertices of that are on the root-to-leaf path of , and this number is , except for , in which case it is 1 smaller. Furthermore, for some , the depth of leaf is at least , because otherwise all the branches leaving the path have one leaf, which can only happen if our UniformĀ DecisionĀ Tree instance is trivial, and it is not trivial by assumption. Thus,
[TABLE]
The and account for the lower order terms described above. Summing (6.6.3) over gives
[TABLE]
where in the first inequality, we used that every leaf has at most one maximal level- imbalanced ancestor, and hence it is in at most one MSSC universe . ā
6.6.4 Bounding imbalanced vertex weight above by
We now finish our bound of the weight of imbalanced vertices.
Lemma 6.16**.**
We have
[TABLE]
Proof.
Each imbalanced vertex is level- imbalanced for some positive integer , so it is part of some level- chain, . Hence,
[TABLE]
The first inequality is not equality because some vertices may be level- imbalanced for more than one integer . The second inequality is by CorollaryĀ 6.14. The third inequality is by LemmaĀ 6.15. ā
6.7 Finishing the proof
Proof of TheoremĀ 3.1.
We have
[TABLE]
as desired. In the first inequality, we used LemmaĀ 6.8 and LemmaĀ 6.16. ā
7 Proof of TheoremĀ 3.2
For the entirety of this section, fix , , and an instance of . We first analyze AlgorithmĀ 1, showing that running FullTree gives a approximation for , and then describe how the algorithm can be modified to give a approximation for UniformĀ DecisionĀ Tree in subexponential time.
7.1 Runtime
Lemma 7.1**.**
Algorithm 2 runs in time .
Proof.
Each call to AlgorithmĀ 2 calls at most recursive calls with one less depth. This means the total number of recursive calls is . Since the local runtime of each call is , the total runtime is . ā
Lemma 7.2**.**
Algorithm 1 runs in time .
Proof.
The cost is dominated by the cost of AlgorithmĀ 2. The depth of recursive calls is at most and the width of the recursive call tree is at most , thus the total runtime is at most times the runtime of Algorithm 2. Thus the runtime is ā
7.2 Notation
To formally analyze the approximation guarantees of AlgorithmĀ 1, we need to generalize some earlier definitions. We say a decision tree is complete with respect to hypothesis set up to depth if, for all hypotheses , either there exists a leaf of with , or . Note that is complete with respect to hypothesis set if it is complete up to depth for all .
Given the tests and a subset of hypotheses, let be the instance induced by . It is given by hypotheses and the tests restricted to domain . It is easy to check that, for this instance, we indeed have . We let denote a greedy tree for the instance . We let denote an optimal tree for instance . We also define the optimal tree for up to depth by
[TABLE]
Importantly, is computable by a straightforward recursive algorithm in time by AlgorithmĀ 2. For convenience, let
[TABLE]
In this way, we have
[TABLE]
7.3 Approximation guarantee
It is easy to see that PartialTree computes (or one such tree, if there are several). Let be the family of hypothesis sets such that FullTree is called at the th level of recursion and the greedy tree is not returned. We consider FullTree to be the 0th level of recursion so that . Let denote the family of hypothesis sets such that FullTree is called and is returned in that call. Let be a partition such that if and only if We know that, for any and , if FullTree is called recursively from FullTree, then . Thus, under these definitions, we know that, for all , the hypothesis sets of are pairwise disjoint, and the hypothesis sets of are pairwise disjoint.
By a double-counting argument, the cost of the output tree is the weighted sum of the partial trees and greedy trees computed in the recursion. Formally, if is the output tree,
[TABLE]
Our proof bounds the depth of the recursion, as well as the summand components.
Lemma 7.3**.**
Let be a collection of disjoint subsets of . Then
Proof.
Recall that is the optimal tree of the instance. By optimality of for the instance instance induced by , we have . Hence,
[TABLE]
Lemma 7.4**.**
We have
[TABLE]
Proof.
For all hypothesis sets , AlgorithmĀ 1 guarantees that . By TheoremĀ 3.1, the greedy algorithm gives a approximation on the instance induced by . Hence, for all . By Theorem 3.1 again, for all , we have
[TABLE]
Hence,
[TABLE]
The last inequality is by LemmaĀ 7.3 and the fact that the are disjoint. Adding (48) and (49) gives the desired result ā
Lemma 7.5**.**
For all , we have
[TABLE]
Proof.
For any , we have . Summing over all , we have
[TABLE]
where the last inequality follows from LemmaĀ 7.3 and that are disjoint. ā
Lemma 7.6**.**
The maximum recursion depth in AlgorithmĀ 1 is at most
Proof.
We show by induction that, for all ,
[TABLE]
This suffices, as then, for , we have . By AlgorithmĀ 1, we have for all (otherwise we take the greedy tree). Thus, the maximum depth of the recursion in Algorithm 1 is less than .
Note that equality holds in (52) for , so the base case is true. For the inductive step, fix , and let . Note that
[TABLE]
Consider a random variable equal to the depth in tree of a random hypothesis in where is chosen with probability proportional to . By above, . Hence, by Markovās inequality, . Thus, the total weight of hypotheses in that are in the next recursive call, i.e.Ā in for some , is at most . This holds for any , so we conclude
[TABLE]
This completes the induction, proving the lemma. ā
Lemma 7.7**.**
Let be the tree returned by FullTree. Then .
Proof.
By (43) and LemmasĀ 7.4, 7.5, and 7.6, we have
[TABLE]
7.4 Uniform Decision Tree
We now describe how to modify AlgorithmĀ 1 to give a approximation for UniformĀ DecisionĀ Tree in subexponential time. By the remark at the end of TheoremĀ 3.1, for all there exists an such that for all , the greedy algorithm gives a approximation on UniformĀ DecisionĀ Tree. Hence, the following modified greedy algorithm runs in polynomial time and gives a approximation: for , run the greedy algorithm, and for , compute the optimal tree by brute force in constant time. For UniformĀ DecisionĀ Tree, set , use the modified greedy algorithm instead of the greedy algorithm, and return the output of the modified greedy algorithm if (rather than ) and keep the rest of AlgorithmĀ 1 the same. LemmaĀ 7.3 still holds. For uniform weights, we have , so . Similar to LemmaĀ 7.4, we are guaranteed that for all , and thus
[TABLE]
LemmaĀ 7.5 still holds. In LemmaĀ 7.6 the maximum depth of recursion is now as the weight of hypotheses at each recursive call shrinks by a factor of and the weight of hypotheses at each nonempty level is at most . Hence, the cost of the output tree has a contribution of at most from the greedy trees and at most from the outputs of the PartialTree, for a total cost of at most .
8 Related Work
There have been several other works analyzing DecisionĀ Tree and they analyze it in a variety of cases to achieve the gold standard . While we examined the case with -ary tests and non-uniform weights, we assumed that the tests had equal costs. Other works [GB09, GNR10] analyze the case where the test costs are non-uniform. [GB09] shows that the greedy algorithm yields when either the costs are non-uniform or the weights are non-uniform (with the rounding trick) but not both. [GNR10] introduces a new algorithm that achieves with both non-uniform weights and costs.
In this work we studied the average depth of decision trees. We remark that, in the worst-case decision tree problem, where the cost of a tree is defined to be the maximum depth of a leaf in the tree, the approximability is known. The greedy algorithm gives an approximation [AMM*+*98], and obtaining a approximation is NP-hard [LN04].
For the worst-case decision tree problem, there is a line of work that examines the absolute query rate rather than the query rate relative to the optimal. In this line of work, the chief goal is to identify conditions where the greedy algorithm achieves the information-theoretically optimal rate . One such condition that ensures the rate is āsample-richā [NJC12], which states that every binary partition of the hypotheses has a test with matching pre-images. [Now09, Now11] introduced the more lenient -neighborly condition, which requires that every two tests be connected by a sequence of tests where neighboring tests disagree on at most hypotheses. An even more general condition is the split-neighborly condition [ML18], which is satisfied if every two tests are connected by a sequence of tests where neighboring tests must have every subset of the disagreeing hypotheses be evenly split by some other test.
9 Conclusion
There are several open questions left by our work.
Could one prove hardness of approximation results for UniformĀ DecisionĀ Tree for ratios larger than ? It would be interesting to prove either NP-hardness results for larger constant factor approximations, or fine-grained complexity results for larger approximation ratios such as in [MR17]. 2. 2.
On the flip side, could one find faster, perhaps polynomial time algorithms for approximating UniformĀ DecisionĀ Tree for ratios where we now have subexponential time algorithms? 3. 3.
On can also consider a generalization of DecisionĀ Tree when the test costs are non-uniform. [GB09, GNR10] Could one obtain similar results in this setting?
10 Acknowledgements
The authors thank Joshua Brakensiek for helpful discussions and feedback on an earlier draft of this paper. The authors thank Mary Wootters for helpful feedback on an earlier draft of this paper. The authors thank anonymous reviews for helpful feedback on an earlier draft of this paper.
Appendix A Proof of TheoremĀ 3.1
We now give a proof of TheoremĀ 3.1, highlighting the differences with the proof of the special case in SectionĀ 6, and suppressing parts of the proof that are identical.
A.1 Notation
We reuse all of the notation in SectionĀ 6.1. The only difference is that, in this section, is not necessarily equal to . Just as in SectionĀ 6, fix .
A.2 The basic argument
LemmaĀ 6.2 is still true, and we restate it for completeness.
Lemma A.1** (LemmaĀ 6.2, restated).**
We have .
At a high level, our proof defines balanced and imbalanced vertices (next subsection) using the parameter and bound the weight of the balanced and imbalanced vertices separately. We bound the weight of the balanced vertices by an entropy argument, and the weight of the imbalanced vertices by partitioning the imbalanced vertices into paths, called chains, and bounding the weights of each chain separately. If a vertex has a heavy hypothesis (defined in SectionĀ A.6.1), we set , and otherwise we set . To bound the cost of imbalanced vertices, we also need to bound the costs of heavy hypotheses . Overall, we make get the following bounds.
[TABLE]
A.3 More notation: Majority and minority answers
Again, we define majority (minority) answers, edges, children, which are useful for defining balanced and imbalanced vertices.
For each vertex in the greedy tree, let denote the test used at . For each vertex , label its children by and so that for all , with ties broken999any tiebreaking procedure suffices, as long as the tiebreaking is consistent with the and notation in the next paragraph. by labeling by the vertex corresponding to the largest answer.101010it is possible to have a vertex that has one child, namely a test that doesnāt distinguish any pairs of hypotheses at a vertex, but such a test is useless and never appears in either the greedy or optimal tree, so we assume it doesnāt exist. Call the edge from to a majority edge111111Here, we may have , so the weight of hypotheses consistent with do not necessarily constitute a majority. However, this does difference does not affect the proof, and we keep the wording to stay consistent with SectionĀ 6., and the edges from to minority edges. Call the minority child of and call the minority children of . Let be the hypotheses consistent with the minority children of , and let their weight be . Accordingly, we have for all . This is illustrated in FigureĀ 4.
In order to reason about the greedy tree precisely, we use the following notation which is more technical. For test and hypotheses , let be the answer to test that accounts for the maximum weight of hypotheses in , with ties broken by choosing the largest indexed answer . We call the majority answer of test with respect to hypothesis set . Call the other answers the minority answers of test with respect to hypothesis set . For all and , let
[TABLE]
We think of () as the set of hypotheses that, under test , output the majority (minority) answer to test with respect to set . Note that, with the above notation, we have and . Under these more general definitions, a generalization of LemmaĀ 6.3 holds. The proof is identical to that of LemmaĀ 6.3, so we omit it.
Lemma A.2**.**
For any vertices of with a descendant of , we have .
A.4 Defining balanced and imbalanced vertices
In the following definition, we identify balanced vertices and imbalanced vertices. By LemmaĀ A.1, we can separately bound the weights of the balanced and imbalanced vertices.
Definition A.3**.**
Let be a positive integer.
We say a vertex is level- imbalanced if and . 2. 2.
We say a vertex is imbalanced if it is level- imbalanced for some , and balanced otherwise. 3. 3.
We say a level- imbalanced vertex is minimal if no descendant of is also level- imbalanced vertex, and a level- imbalanced vertex is maximal if no ancestor of is level- imbalanced.
Let
[TABLE]
and note that interior level- imbalanced vertices exist only for . The following lemma proves a structural result about balanced vertices, with the punchline being item (iii), which permits DefinitionĀ A.5. The proof of LemmaĀ A.4 is nearly identical to that of LemmaĀ 6.5. We include a proof of item (i) because of a subtle difference to the proof of item (i) of LemmaĀ 6.5. However, the proofs of the other two parts are identical, so we omit them.
Lemma A.4**.**
Let be a positive integer.
- (i)
If is a level-* imbalanced vertex, then, among the children of , only can be a level- imbalanced vertex.* 2. (ii)
Additionally, if and are level-* imbalanced vertices and is an ancestor of , then every vertex on the path from to is a level- imbalanced vertex.* 3. (iii)
Finally, the set of level-* imbalanced vertices can be partitioned into vertex disjoint paths, each of which connects a maximal level- imbalanced vertex to a minimal level- imbalanced vertex and contains only majority edges.*
Proof of (i).
Note that if is level- imbalanced, then , which means every different from satisfies , so such cannot be level- imbalanced. Hence, among the children of , only can be level- imbalanced. ā
LemmaĀ A.4 motivates the following definition.
Definition A.5**.**
Let be a positive integer. A level- chain, , is a sequence of level- imbalanced vertices starting at a maximal level- imbalanced vertex and ending at a minimal level- imbalanced vertex. By LemmaĀ A.4, the level- chains partition the level- imbalanced vertices. We therefore let denote the level- chains.
In general, for , a level- chain might overlap with a level- chain.
A.5 Bounding the weight of balanced vertices
Under these definitions, a generalization of LemmaĀ 6.7 is still true. The proof is identical to that of LemmaĀ 6.7, so we omit it.
Lemma A.6**.**
For every balanced vertex , we have .
We now bound the contribution of the balanced vertices to the weight using an entropy argument. Now, in the general case, the entropy argument requires a little more care when bounding the entropy of a single -ary test.
Lemma A.7**.**
We have
[TABLE]
Proof.
For a vertex with a test of index , let denote the random variable supported on that is equal to for an hypothesis chosen randomly from the elements of , where the probability of choosing is proportional to . Let denote the entropy of a random variable, and by abuse of notation, let . By abuse of notation, for nonnegative summing to 1, let where is taken to be 0. The entropy of a random element chosen according to the prior distribution is at most . On the other hand, we can pick a random hypothesis in according to the distribution p by setting to the root of , sampling an answer for the test at , setting to the child of corresponding to the chosen answer , and repeating, until we reach a leaf. In this process, at any vertex , the probability of stepping to a child is exactly . Hence, by a simple induction, the probability of reaching any vertex in the tree during this process is exactly . The total entropy of this process is thus , as is the probability of reaching vertex and is the entropy of the random variable chosen at vertex .
Fix a balanced vertex . We claim that . Let denote the region given by the constraints for all , and , and . We claim that the minimum of for is . To see this, note first that this region is closed and bounded, so the function obtains a minimum. Furthermore, note that, for , by concavity of , for any and any , setting gives . Similarly pushing and apart by the same positive also decreases the value of . Hence, the maximum cannot be obtained when two of are positive, nor can it be obtained when . It follows that the only local minima in the region occur when some is and .
For , let denote the probability that . By LemmaĀ A.6, when is balanced, must, up to a permutation in coordinates, be in region . Hence, by the above we have
[TABLE]
Putting the above two paragraphs together, we conclude
[TABLE]
and rearranging gives the desired result. ā
A.6 Bounding the weight of imbalanced vertices
We now bound the weight of imbalanced vertices using a connection to Weighted Min Sum Set Cover. For each hypothesis , let denote the leaf in the greedy tree for which hypothesis is consistent. Since is complete, this leaf exists and is unique.
A.6.1 Technical definition: heavy vertices
We need the following technical definition to make the connection between the greedy decision tree and a greedy WMSSC solution.
Definition A.8**.**
For a vertex and an hypothesis , we say is -heavy if is consistent with and .
Lemma A.9**.**
Let be a hypothesis.
- (i)
If is -heavy, then every vertex on the path from to leaf is -heavy. 2. (ii)
Additionally, if is -heavy, then every edge on the path from to leaf is a majority edge. 3. (iii)
Lastly, for any vertex , there exists at most one hypothesis such that is -heavy.
Proof.
Item (i) is true by LemmaĀ A.2, which says that decreases as one descends the tree.
For (ii), it suffices to prove, by the first part, that for every -heavy vertex , the first edge on the path from to is a majority edge. Suppose for contradiction that there exists and an -heavy vertex with a minority child such that is a descendant of . Then , which contradicts the definition of being heavy.
For (iii), suppose for contradiction there exists two hypotheses and such that is both -heavy and -heavy. Since our DecisionĀ Tree instance is well defined, there exists some test that distinguishes and , i.e.Ā . As is -heavy, we have . Hence the answer for hypothesis under test is , the answer to accounting for the maximum weight of hypotheses in : if not choosing test at vertex would make the weight of hypotheses consistent with a minority child to be . This is a contradiction as and the tree is greedy. However, as is also -heavy, we have, by the same reasoning, that . This is a contradiction, as test was chosen to distinguish and . ā
We now define some notation for dealing with non-uniform weights , which are well-defined by LemmaĀ A.9.
Definition A.10**.**
For hypothesis , let be the maximal ancestor of that is -heavy. For vertex , if there exists an such that is -heavy, let , and otherwise let .
A.6.2 Defining Weighted Min Sum Set Cover
Recall and .
Definition A.11**.**
Let denote the instance of weighted min sum set cover that is induced by the chain . This instance is given by
- ā¢
universe with weights ,
- ā¢
for , sets , and
- ā¢
for each , a set consisting of one element.121212Some of these sets are empty, but we include them for notational convenience.
Note we have a total of sets. A solution to the WMSSC problem is a permutation corresponding to an ordering of the sets, and the cost of a solution is the weighted sum of the cover times of the elements in the universe . Formally,
[TABLE]
Note that this instance is well defined, as each hypothesis is in some set . We sometimes refer to a solution by the sets .
Remark A.12**.**
Since the initial DecisionĀ Tree instance is well defined, any two elements can be distinguished by one of the tests. Hence, there is at most one element such that, for all , we have . In other words, all but one of the sets for are unnecessary.
Definition A.13**.**
We say a solution to is greedy at index if the set covers the maximum number of elements not covered by sets . We say a solution is greedy if it is greedy at index for all ,
Note that, in the case of ties, there may be multiple greedy solutions to . Note also that, for any partial assignment , one can always complete the solution greedily, so that is greedy at indices . DefinitionĀ A.13 lets us leverage the following theorem, due to Golovin and Krause, which generalizes TheoremĀ 6.12.131313In fact, [GK11] considers an even more general problems called Adaptive Stochastic Min-Sum Cover.
Theorem A.14** (Theorem 5.10 of [GK11]).**
The greedy algorithm gives a 4-approximation to the WMSSC problem. Formally, let be any greedy solution to , and let denote an optimal solution. We have
[TABLE]
A.6.3 Bounding chain weight above by WMSSC cost
Lemma A.15**.**
Let be a positive integer and let be a level- chain. Then there exists a greedy solution to , such that
[TABLE]
Proof.
Let be the universe of the instance , and let be the sets. For , let be the test used at vertex . Let be the largest index such that is not -heavy for any , or 0 if no such index exists. If , let be the hypothesis such that is -heavy. By LemmaĀ A.4, all the edges along the path are majority edges. By LemmaĀ A.9, for all , vertex is -heavy. Define a solution to as follows.
- ā¢
If , for , let and complete the solution greedily.
- ā¢
Otherwise, for , let , let , let for , and complete the solution greedily.
We claim is a greedy solution. To prove this, we show the following.
- (i)
For all and , the majority answer for test with respect to vertex is the same as the majority answer for test with respect to vertex . Equivalently, for all and , we have . As an immediate consequence, we know contains all the hypotheses in and none of the hypotheses in .
- (ii)
The set of hypotheses of not covered by is exactly .
- (iii)
For each , among sets , set covers the maximum weight of hypotheses in , i.e.Ā we have
[TABLE]
- (iv)
For each , among sets , set covers the maximum weight of hypotheses in .
- (v)
If , then, among sets , set covers the maximum weight of hypotheses in .
- (vi)
If , then, for , among sets , set covers the maximum weight of hypotheses in .
These points suffices for proving that is greedy. If , items (ii) and (iv) tell us that is greedy at indices , so by construction is greedy. If , then (iv), (v), and (vi) tell us that is greedy at indices , so is greedy.
To show (i), fix and . As is level- imbalanced, we also have and and , so is the unique answer in accounting for more than half of the weight of hypotheses in . On the other hand, as vertex is level- imbalanced, we have , so the majority answer for test with respect to hypothesis set is exactly the answer described in the previous sentence. Hence .
Item (ii) follows because is the set of hypotheses consistent with , which was obtained by following the majority edges from . This means contains all the hypotheses of not a consistent with a minority child of one of . By the last paragraph, this is exactly .
For (iii), at vertex in the greedy decision tree, the test index maximizes the weight . By (i), this index equivalently maximizes , as desired.
For (iv), at step for , by (i), the set covers a weight of hypotheses in , which is more than by definition of . By (iii), covers at least as much weight of hypotheses in as any of , and, by RemarkĀ A.12 and the previous sentence, at least as much as any of .
For (v), by maximality of , we have . Hence, the singleton covers more weight of hypotheses in than any of , and thus, by RemarkĀ A.12, than any of .
For (vi), if there exists such that some is -heavy, then, for any , we have and . Hence the among that covers the most weight of is by (iv). By RemarkĀ A.12, the only set among that could cover a larger weight of is , but it in fact covers 0 weight of , so among sets , set covers the maximum weight of hypotheses in . This completes the proof that is greedy.
We now return to the proof of LemmaĀ A.15. Take the greedy solution given above. If , then the set of vertices of not covered by is exactly , which has weight . Hence, by (27),
[TABLE]
Now suppose . Recall that the definition of implies are all -heavy. Hence, we have for , and for . Thus,
[TABLE]
as desired. In the last equality, we used that for . ā
Let be the greedy solution to given by LemmaĀ A.15, and let be an optimal solution to .
A.6.4 Bounding WMSSC cost above by
Lemma A.16**.**
Let be a positive integer. We have
[TABLE]
Proof.
Let be the universe of the instance , and let be the sets. Construct a path in such that is the root, is a leaf for hypothesis , and, for , if the test at vertex has index , the edge to its child corresponds to the answer , the majority answer of test with respect to set . Suppose the test at vertex in the optimal tree has index . Since we follow the edges with label , this corresponds to following the path for an hypothesis contained in . In other words, we have, for ,
[TABLE]
Thus the sequence covers , and hence , and thus gives a valid solution to the instance , where , and on larger indices is arbitrarily chosen. Note that the depth of a hypothesis in the tree is at least the number of vertices of that are on the root-to-leaf path of , and this number is , except for , in which case it is 1 smaller. Then we have
[TABLE]
Summing over gives
[TABLE]
ā
Lemma A.17**.**
We have
[TABLE]
Proof.
Each imbalanced vertex is level- imbalanced for some positive integer , so it is part of some level- chain, . Note that for all vertices . Hence,
[TABLE]
The first inequality is because for all . The second inequality is because for all , and that every imbalanced is in some chain. The third inequality is by LemmaĀ A.15. The fourth inequality is by TheoremĀ A.14. The fifth inequality is by LemmaĀ A.16. Rearranging gives the desired result. ā
A.7 Bounding the cost contribution of heavy vertices
We bound the cost contribution of the heavy vertices via a connection to SET-COVER. A theorem due to Lovasz [Lov75], Johnson [Joh74], Chvatal [Chv79], and Stein [Ste74] states that, in any instance of SET-COVER where no set covers more than elements, the greedy algorithm gives a approximation. We show a generalization of this result, based on the following the definition.
Definition A.18**.**
Let be an instance of SET-COVER with a universe and sets . Let be a sequence of weights assigned to the elements of . A p-weighted greedy algorithm for SET-COVER is repeatedly chooses the set that minimizes , where is the set of uncovered elements.
Theorem A.19**.**
Let be an instance of SET-COVER with a universe and sets . Let be a sequence of weights assigned to the elements of . Then, if the optimal solution to has uses at most sets, then the p-weighted greedy algorithm uses at most sets, where .
While this argument may be known, we are not aware of a known reference, so we provide a proof for completeness in AppendixĀ B.
We now bound .
Lemma A.20**.**
For all , we have
[TABLE]
Proof.
Fix . Let denote the path from vertex to leaf in the greedy tree. Let denote the SET-COVER instance with the following parameters:
- ā¢
Universe
- ā¢
For , sets .
Let denote the cost of the optimal solution to , and for , let denote the test at vertex .
We make the following observations. First, by definition of , for all , the set does not contain hypothesis : vertex satisfies so the answer of a test that accounts for the largest weight of hypotheses in always contains the hypothesis , as any other answer, by the definition of the greedy algorithm, has weight at most .
Second, the sets form a p-weighted greedy solution for this SET-COVER instance , where . For , the first sets in the above sequence cover all of except the elements of . In the greedy decision tree, the index that maximizes . Note that, for any , the set are the hypotheses for the answers of that exclude hypothesis . However, the set also contains exactly the hypotheses for the answers of that exclude hypothesis . Hence , so maximizes . Thus, sets form a p-weighted greedy solution for this SET-COVER instance . Hence, we may apply TheoremĀ A.19. Since and , we have, by TheoremĀ A.19,
[TABLE]
It remains to prove . Let be the root-to-leaf path for hypothesis in the optimal tree . For , set to be the test chosen at vertex in the optimal tree. By the first point, the set contains the hypotheses for the answers of test that exclude hypothesis . As the optimal tree is a decision tree, any test is distinguished from by one of tests , so cover all of , and thus covers . Furthermore, , so there is a solution to of size . Hence, , as desired. ā
As a corollary, we have
Lemma A.21**.**
[TABLE]
Proof.
We have
[TABLE]
The first equality uses LemmaĀ A.9, which tells us that every vertex between and is -heavy, and no such vertex is -heavy for . Furthermore, these are the only -heavy vertices as is the maximal -heavy vertex. The inequality is by LemmaĀ A.20. ā
A.8 Finishing the proof
Proof of TheoremĀ 3.1.
We have
[TABLE]
as desired. In the last inequality, we used that (i) , (ii) , (iii) , and . ā
Appendix B Proof of TheoremĀ A.19
We closely follow the argument of ChvatalĀ [Chv79]. Suppose the -weighted greedy algorithm uses sets. By re-indexing the sets, we may assume without loss of generality that the p-weighted greedy algorithm chooses sets in that order. For and , let denote the elements of set not covered by the first chosen sets. For and , let denote the sum of the weights of the elements of . For , let , where is the index at which element is first covered. Equivalently, is the unique index such that . In this way, we have
[TABLE]
and, for all ,
[TABLE]
where is the largest index such that . Hence using that is a non-increasing sequence, we have
[TABLE]
Let denote the indices of the optimal cover for . Applying (82) and summing (83) for , we have
[TABLE]
as desired.
Appendix C Tightness of TheoremĀ 3.1
In this section, we prove PropositionsĀ 3.3 and 3.4, which show two ways that TheoremĀ 3.1 is tight.
C.1 Proof of PropositionĀ 3.3
Proof.
We prove PropositionĀ 3.3 with the stronger guarantee that when is an integer. Then, taking gives the desired result.
When is sufficiently large, for , the statement is trivial, as any instance for which , satisfies the requirements. Number the hypotheses . Let be such that . Place the hypotheses in a grid with columns and rows, numbered , so that each grid square contains at most 1 hypothesis. Recursively identify a family of good sets of rows as follows: is good, and for every good set containing rows, create a partition such that and , and identify and as good. Define three types of tests:
For each of , a test that outputs 1 if the hypothesis is and 2 otherwise. 2. 2.
For each column , a tests that outputs 1 if the hypothesis is in column and 2 otherwise. 3. 3.
For each column and , a test that outputs 1 if the hypothesis is in column and the th digit of the row numberās binary expansion is a one, and 2 otherwise. 4. 4.
for each good set , a test that outputs 1 if the row of is in and 2 otherwise.
Let be the unknown hypothesis. There is a strategy that first checks whether is one of , for a total of at most queries. If not, the strategy identifies the column containing in at most queries using tests of type 2 and then identifies the corresponding row using tests of type 3, which takes queries. We thus need at most queries for each hypothesis, so here
[TABLE]
The greedy strategy uses tests of type 4, trying to first find the row containing . This is because, for tests 1, 2, and 3, one answer accounts for at least fraction of the remaining hypotheses (it accounts at least fraction of the columns in the grid), and if the candidate set of rows containing is a good set , under the membership test for the good set , all answers account for less than fraction of the remaining hypotheses.
When is chosen uniformly at random from , the row containing is a uniformly random row. While there are at least candidate rows containing , each test gives at most bits of information about the row containing in expectation (over the randomness of ). Since the row containing has at least bits of information, and the row has at most bits of information when there are at most candidate rows remaining, we have, by an analysis similar to LemmaĀ 6.8, the greedy algorithm takes at least queries to identify the row containing on average. Hence,
[TABLE]
as desired. In the last inequality, we used that for . ā
C.2 Proof of PropositionĀ 3.4
In this appendix, we use that it is NP-hard to approximate SetĀ Cover to within a factor of [Mos12].
Theorem C.1**.**
Let . Then, for sufficiently large, approximating to a factor of is NP-hard.
Proof.
We design a reduction from SetĀ Cover to . Suppose we are given a SetĀ Cover instance with elements, sets where , and an optimal cover of size . In polynomial time, we construct a instance on hypotheses such that, if is the optimal decision tree cost, then, for some ,
[TABLE]
The theorem follows as a approximation to SetĀ Cover is NP-hard, and here .
Let . Let . Identify the hypotheses by elements of . In this way, there are hypotheses. Let the elements of have weight , and let for all other hypotheses . In this way, for sufficiently large, we have . Create tests of the following forms:
For each , define a test that outputs on hypotheses if and only if . 2. 2.
For each , and , define a binary test that outputs 1 if and only if , and the th bit of ās binary representation is 1. 3. 3.
For each , define a binary test on that outputs 1 if and only if the th bit of ās binary representation is .
Consider a cover of using of the sets. We can define a tree that, given a hypothesis , first determines using tests of type 3. Then, in each subtree, the set of consistent hypotheses is exactly . In each subtree, one can isolate the hypothesis in queries, using tests of type 1, and use tests of type 2 to identify the remaining hypotheses in tests each. In each subtree, all of the hypotheses of the form have weight at most and depth at most , and all other hypotheses have weight and depth at most , so their total contribution to the cost of the tree is at most . Assuming is sufficiently large,
[TABLE]
Now suppose we are given a solution to the with cost . In the optimal tree, for all hypotheses , at least tests of type-1 or type-2 must appear on the root-to-leaf path of : if not, there exists , such that at most tests of type-1 with parameters or type-2 with parameters were used. By taking the indices used in these tests, there are at most sets covering , which is a contradiction. Thus, each hypothesis has depth at least . Since the hypotheses account for at least half of the weight of the hypotheses, we have
[TABLE]
This completes the proof. ā
Appendix D Rounding weights
Proposition D.1**.**
Suppose a DecisionĀ Tree instance has weights and a cost function . Then, there exist weights such that and the cost function of the associated instance satisfies for all decision trees .
Proof.
Let and define . We have and . Let , so that for all . Hence, for all decision trees ,
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[ABS 15] Sanjeev Arora, Boaz Barak, and David Steurer. Subexponential algorithms for unique games and related problems. J. ACM , 62(5):42:1ā42:25, 2015.
- 2[AH 12] Micah Adler and Brent Heeringa. Approximating optimal binary decision trees. Algorithmica , 62(3-4):1112ā1121, 2012.
- 3[AMM + 98] Esther M Arkin, Henk Meijer, Joseph SB Mitchell, David Rappaport, and Steven S Skiena. Decision trees for geometric models. International Journal of Computational Geometry & Applications , 8(03):343ā363, 1998.
- 4[Bab 16] LĆ”szló Babai. Graph isomorphism in quasipolynomial time [extended abstract]. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016 , pages 684ā697, 2016.
- 5[Chv 79] Vasek Chvatal. A greedy heuristic for the set-covering problem. Mathematics of operations research , 4(3):233ā235, 1979.
- 6[CJLM 10] Ferdinando Cicalese, Tobias Jacobs, Eduardo Laber, and Marco Molinaro. On greedy algorithms for decision trees. In International Symposium on Algorithms and Computation , pages 206ā217. Springer, 2010.
- 7[CPR + 11] Venkatesan T. Chakaravarthy, Vinayaka Pandit, Sambuddha Roy, Pranjal Awasthi, and Mukesh K. Mohania. Decision trees for entity identification: Approximation algorithms and hardness results. ACM Trans. Algorithms , 7(2):15:1ā15:22, 2011.
- 8[CPRS 09] Venkatesan T. Chakaravarthy, Vinayaka Pandit, Sambuddha Roy, and Yogish Sabharwal. Approximating decision trees with multiway branches. In Automata, Languages and Programming, 36th International Colloquium, ICALP 2009, Rhodes, Greece, July 5-12, 2009, Proceedings, Part I , pages 210ā221, 2009.
