A Tight Analysis of Greedy Yields Subexponential Time Approximation for   Uniform Decision Tree

Ray Li; Percy Liang; Stephen Mussmann

arXiv:1906.11385·cs.DS·October 23, 2019

A Tight Analysis of Greedy Yields Subexponential Time Approximation for Uniform Decision Tree

Ray Li, Percy Liang, Stephen Mussmann

PDF

Open Access

TL;DR

This paper provides a tight analysis of the greedy algorithm for Uniform Decision Tree, showing its approximation ratio depends on the optimal cost, and introduces subexponential algorithms with implications for complexity theory.

Contribution

It establishes a precise approximation bound for greedy algorithms on Uniform Decision Tree and introduces subexponential algorithms, resolving a conjecture and connecting to Min Sum Set Cover.

Findings

01

Greedy algorithm achieves an $O(rac{ ext{log } n}{ ext{log } C_{OPT}})$ approximation.

02

Subexponential time algorithms with ratio $rac{9.01}{ ext{alpha}}$ for all $ ext{alpha} ext{ in}(0,1)$.

03

Achieving super-constant approximation ratios is unlikely to be NP-hard under ETH.

Abstract

Decision Tree is a classic formulation of active learning: given $n$ hypotheses with nonnegative weights summing to 1 and a set of tests that each partition the hypotheses, output a decision tree using the provided tests that uniquely identifies each hypothesis and has minimum (weighted) average depth. Previous works showed that the greedy algorithm achieves a $O (lo g n)$ approximation ratio for this problem and it is NP-hard beat a $O (lo g n)$ approximation, settling the complexity of the problem. However, for Uniform Decision Tree, i.e. Decision Tree with uniform weights, the story is more subtle. The greedy algorithm's $O (lo g n)$ approximation ratio was the best known, but the largest approximation ratio known to be NP-hard is $4 - ε$ . We prove that the greedy algorithm gives a $O (\frac{l o g n}{l o g C _{O P T}})$ approximation for Uniform Decision Tree, where $C_{O P T}$ is the…

Equations204

\displaystyle C(\mathcal{T})\

\displaystyle C(\mathcal{T})\

j_{v} \in j arg min (k \in [K] max p (L (v) \cap τ_{j}^{- 1} (k))),

j_{v} \in j arg min (k \in [K] max p (L (v) \cap τ_{j}^{- 1} (k))),

C_{G} \leq (\frac{12 \cdot lo g ( \frac{1}{p _{min}} )}{lo g C _{OPT}} + ln (\frac{p _{max}}{p _{min}})) \cdot C_{OPT} .

C_{G} \leq (\frac{12 \cdot lo g ( \frac{1}{p _{min}} )}{lo g C _{OPT}} + ln (\frac{p _{max}}{p _{min}})) \cdot C_{OPT} .

C_{G} \geq \frac{C ^{*} lo g n}{16 lo g C ^{*}}, and C_{OPT} \leq 8 C^{*} .

C_{G} \geq \frac{C ^{*} lo g n}{16 lo g C ^{*}}, and C_{OPT} \leq 8 C^{*} .

C_{G} = v \in T_{G}^{o} \sum p (v),

C_{G} = v \in T_{G}^{o} \sum p (v),

\displaystyle\log n\

\displaystyle\log n\

\displaystyle\sum_{v\text{ balanced}}p(v)\

\displaystyle\sum_{v\text{ balanced}}p(v)\

MSSC (σ) = def \frac{1}{n} h \in S \sum min {ℓ : h \in A_{σ (ℓ)}} .

MSSC (σ) = def \frac{1}{n} h \in S \sum min {ℓ : h \in A_{σ (ℓ)}} .

v \in P \sum p (v) \leq MSSC^{(P)} (σ_{G}) \leq 4 MSSC^{(P)} (σ_{OPT})

v \in P \sum p (v) \leq MSSC^{(P)} (σ_{G}) \leq 4 MSSC^{(P)} (σ_{OPT})

P \in P_{s} \sum MSSC^{(P)} (σ_{OPT}) \leq C_{OPT} .

P \in P_{s} \sum MSSC^{(P)} (σ_{OPT}) \leq C_{OPT} .

\displaystyle\sum_{v\text{ level $s$}}p(v)\

\displaystyle\sum_{v\text{ level $s$}}p(v)\

v imbalanced \sum p (v) \leq \frac{4 lo g n}{lo g \frac{1}{δ}} \cdot C_{OPT} .

v imbalanced \sum p (v) \leq \frac{4 lo g n}{lo g \frac{1}{δ}} \cdot C_{OPT} .

C_{G} = v \in T_{G}^{o} \sum p (v) = v balanced \sum p (v) + v imbalanced \sum p (v) \leq \frac{lo g n}{lo g \frac{2}{δ}} \cdot \frac{2}{δ} + \frac{4 lo g n}{lo g \frac{1}{δ}} \cdot C_{OPT} .

C_{G} = v \in T_{G}^{o} \sum p (v) = v balanced \sum p (v) + v imbalanced \sum p (v) \leq \frac{lo g n}{lo g \frac{2}{δ}} \cdot \frac{2}{δ} + \frac{4 lo g n}{lo g \frac{1}{δ}} \cdot C_{OPT} .

\displaystyle\sum_{v\text{ imbalanced}}(p(v)-q(v))\

\displaystyle\sum_{v\text{ imbalanced}}(p(v)-q(v))\

\leq s = 1 \sum s_{max} P \in P_{s} \sum 4 WMSSC^{(P)} (σ_{OPT}^{(P)}) \leq s = 1 \sum s_{max} C_{OPT} = s_{max} C_{OPT},

v imbalanced \sum p (v) \leq s_{max} C_{OPT} + v \in T_{G} \sum q (v) \leq s_{max} C_{OPT} + (1 + ln \frac{p _{max}}{p _{min}}) C_{OPT} .

v imbalanced \sum p (v) \leq s_{max} C_{OPT} + v \in T_{G} \sum q (v) \leq s_{max} C_{OPT} + (1 + ln \frac{p _{max}}{p _{min}}) C_{OPT} .

C (T; H) = def \frac{1}{p ( H )} h \in H \sum p_{h} d_{T} (h),

C (T; H) = def \frac{1}{p ( H )} h \in H \sum p_{h} d_{T} (h),

\displaystyle C_{G}\

\displaystyle C_{G}\

\displaystyle C_{G}\

\displaystyle C_{G}\

C_{G} = v \in T_{G}^{o} \sum p (v) = v balanced \sum p (v) + v imbalanced \sum p (v) \leq \frac{lo g n}{lo g \frac{2}{δ}} \cdot \frac{2}{δ} + \frac{4 lo g n}{lo g \frac{1}{δ}} \cdot C_{OPT} .

C_{G} = v \in T_{G}^{o} \sum p (v) = v balanced \sum p (v) + v imbalanced \sum p (v) \leq \frac{lo g n}{lo g \frac{2}{δ}} \cdot \frac{2}{δ} + \frac{4 lo g n}{lo g \frac{1}{δ}} \cdot C_{OPT} .

I_{j, S}^{+} = def τ_{j}^{- 1} (k_{j, S}^{+}), I_{j, S}^{-} = def τ_{j}^{- 1} (k_{j, S}^{-}) .

I_{j, S}^{+} = def τ_{j}^{- 1} (k_{j, S}^{+}), I_{j, S}^{-} = def τ_{j}^{- 1} (k_{j, S}^{-}) .

p (v^{-}) \geq p (I_{j_{u}, L (v)}^{-} \cap L (v)) \geq p (I_{j_{u}, L (v)}^{-} \cap L (u)) \geq p (u^{-}) .

p (v^{-}) \geq p (I_{j_{u}, L (v)}^{-} \cap L (v)) \geq p (I_{j_{u}, L (v)}^{-} \cap L (u)) \geq p (u^{-}) .

s_{m a x} = def \frac{lo g n}{lo g \frac{1}{δ}}

s_{m a x} = def \frac{lo g n}{lo g \frac{1}{δ}}

p (v) > \frac{2}{δ} p (v^{-}) = \frac{2}{δ} δ^{s^{'}} n = 2 δ^{s^{'} - 1} n > 2 δ^{⌊ s^{'} ⌋} n .

p (v) > \frac{2}{δ} p (v^{-}) = \frac{2}{δ} δ^{s^{'}} n = 2 δ^{s^{'} - 1} n > 2 δ^{⌊ s^{'} ⌋} n .

v balanced \sum p (v) \leq \frac{lo g n}{lo g \frac{2}{δ}} \cdot \frac{2}{δ} .

v balanced \sum p (v) \leq \frac{lo g n}{lo g \frac{2}{δ}} \cdot \frac{2}{δ} .

H (X_{v}) = p (v^{+}) lo g \frac{1}{p ( v ^{+} )} + p (v^{-}) lo g \frac{1}{p ( v ^{-} )} \geq p (v^{-}) lo g \frac{1}{p ( v ^{-} )} \geq \frac{δ}{2} lo g \frac{2}{δ} .

H (X_{v}) = p (v^{+}) lo g \frac{1}{p ( v ^{+} )} + p (v^{-}) lo g \frac{1}{p ( v ^{-} )} \geq p (v^{-}) lo g \frac{1}{p ( v ^{-} )} \geq \frac{δ}{2} lo g \frac{2}{δ} .

\displaystyle\log n\

\displaystyle\log n\

MSSC^{(P)} (σ) = def \frac{1}{n} h \in S \sum min {ℓ : h \in A_{σ (ℓ)}} = ℓ = 1 \sum m + n p (S ∖ (A_{σ (1)} \cup \dots \cup A_{σ (ℓ - 1)})) .

MSSC^{(P)} (σ) = def \frac{1}{n} h \in S \sum min {ℓ : h \in A_{σ (ℓ)}} = ℓ = 1 \sum m + n p (S ∖ (A_{σ (1)} \cup \dots \cup A_{σ (ℓ - 1)})) .

MSSC^{(P)} (σ_{G}) \leq 4 \cdot MSSC^{(P)} (σ_{OPT}) .

MSSC^{(P)} (σ_{G}) \leq 4 \cdot MSSC^{(P)} (σ_{OPT}) .

v \in P \sum p (v) \leq MSSC^{(P)} (σ) .

v \in P \sum p (v) \leq MSSC^{(P)} (σ) .

p (A_{j_{ℓ}} \cap L (P_{ℓ})) = j \in [m + n] max p (A_{j} \cap L (P_{ℓ})) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Complexity and Algorithms in Graphs · Formal Methods in Verification

Full text

A Tight Analysis of Greedy Yields Subexponential Time Approximation for Uniform Decision Tree

Ray Li , Percy Liang , Stephen Mussmann Department of Computer Science, Stanford University. Research supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE - 1656518. Email: [email protected] of Computer Science, Stanford University. Email: [email protected] of Computer Science, Stanford University. Research supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE - 1656518. Email: [email protected]

Abstract

Decision Tree is a classic formulation of active learning: given $n$ hypotheses with nonnegative weights summing to 1 and a set of tests that each partition the hypotheses, output a decision tree using the provided tests that uniquely identifies each hypothesis and has minimum (weighted) average depth. Previous works showed that the greedy algorithm achieves a $O(\log n)$ approximation ratio for this problem and it is NP-hard beat a $O(\log n)$ approximation, settling the complexity of the problem.

However, for Uniform Decision Tree, i.e. Decision Tree with uniform weights, the story is more subtle. The greedy algorithm’s $O(\log n)$ approximation ratio was the best known, but the largest approximation ratio known to be NP-hard is $4-\varepsilon$ . We prove that the greedy algorithm gives a $O(\frac{\log n}{\log C_{\textnormal{OPT}}})$ approximation for Uniform Decision Tree, where $C_{\textnormal{OPT}}$ is the cost of the optimal tree and show this is best possible for the greedy algorithm. As a corollary, we resolve a conjecture of Kosaraju, Przytycka, and Borgstrom [KPB99]. Our results also hold for instances of Decision Tree whose weights are not too far from uniform. Leveraging this result, for all $\alpha\in(0,1)$ , we exhibit a $\frac{9.01}{\alpha}$ approximation algorithm to Uniform Decision Tree running in subexponential time $2^{\tilde{O}(n^{\alpha})}$ . As a corollary, achieving any super-constant approximation ratio on Uniform Decision Tree is not NP-hard, assuming the Exponential Time Hypothesis. This work therefore adds approximating Uniform Decision Tree to a small list of natural problems that have subexponential time algorithms but no known polynomial time algorithms. Like the analysis of the greedy algorithm, our analysis of the subexponential time algorithm gives similar approximation guarantees even for slightly nonuniform weights. A key technical contribution of our work is showing a connection between greedy algorithms for Uniform Decision Tree and for Min Sum Set Cover.

1 Introduction

In Decision Tree (also known as Split Tree), one is given $n$ hypotheses with nonnegative weights $p_{1},\dots,p_{n}$ summing to 1 and a set of $m$ $K$ -ary tests that each partition the hypotheses, and must output a decision tree using the provided tests that uniquely identifies each hypothesis and has minimum (weighted) average depth.111We require such a decision tree always exists in a valid Decision Tree instance. Decision Tree is a classic problem that arises naturally in active learning [Das04, Now11, GB09] and hypothesis identification [Mor82]. Active learning with a well-specified and finite hypothesis class with noiseless tests is precisely Decision Tree where the tests are data points and the answers are their labels. Decision Tree was first proved to be NP-hard by Hyafil and Rivest [HR76]. Since then, a large number works have provided algorithms for this question [GG74, Lov85, KPB99, Das04, CPR*+*11, CPRS09, GB09, GNR10, CJLM10, AH12].

A natural algorithm for Decision Tree is the greedy algorithm, which creates a decision tree by iteratively choosing the test that most evenly splits the set of remaining hypotheses. For binary tests ( $K=2$ ), there is a natural notion of “most even split,” but for $K>2$ , there are multiple possible definitions (see discussion in Section 2). It is well known that the greedy algorithm achieves an $O(\log n)$ approximation ratio for Decision Tree assuming all weights are at least $\frac{1}{\operatorname*{poly}(n)}$ . It was first shown for binary tests and uniform weights ( $p_{h}=1/n$ for all $h$ ) [KPB99, AH12], then $K$ -ary tests [CPRS09], and finally, general non-uniform weights [GB09]. Furthermore, it is NP-hard to achieve a $o(\log n)$ approximation ratio for Decision Tree [CPR*+*11], settling the complexity of approximating Decision Tree.

However, there are still gaps in our knowledge. For Uniform Decision Tree, i.e. Decision Tree with uniform weights, the $O(\log n)$ approximation given by the greedy algorithm was previously the best known approximation achievable in polynomial time. Chakaravarthy et al. [CPR*+*11] proved that it is NP-hard to give a $(4-\varepsilon)$ approximation, giving the best known hardness of approximation result, and they asked whether the gap between the best approximation and hardness results could be improved. Previously, it was not even known whether the greedy algorithm could beat the $O(\log n)$ approximation ratio in previous analyses: the best lower bound on the greedy algorithm’s approximation ratio is $\Omega(\frac{\log n}{\log\log n})$ [KPB99, Das04]. In the setting where the optimal solution to Uniform Decision Tree has cost $O(\log n)$ , Kosaraju et al. [KPB99] showed that the greedy algorithm indeed gives an $O(\frac{\log n}{\log\log n})$ approximation, and they conjectured that the greedy algorithm gives an $O(\frac{\log n}{\log\log n})$ approximation in general.

For an extended discussion of related works, see Section 8.

1.1 Our contributions

We summarize the main contributions of our work below. The approximation guarantees of our algorithms are captured in Figure 1.

•

Greedy algorithm. We give a new analysis of the greedy algorithm, showing that it gives an $O(\frac{\log n}{\log C_{\textnormal{OPT}}}+\log\frac{p_{\max}}{p_{\min}})$ approximation for Decision Tree, where $C_{\textnormal{OPT}}$ is the cost of the optimal tree, $p_{\max}=\max_{h}p_{h}$ , and $p_{\min}=\min_{h}p_{h}$ . This implies an $O(\frac{\log n}{\log C_{\textnormal{OPT}}})$ approximation for instances of Uniform Decision Tree and of Decision Tree whose weights are close to uniform. As $C_{\textnormal{OPT}}\geq\log_{K}n$ always, this proves the conjecture of Kosaraju et al. [KPB99].

•

Subexponential time algorithm. Leveraging the above greedy analysis, for $\alpha<1$ , we give a subexponential222Throughout this work, subexponential means $2^{n^{\alpha}}$ for some absolute $\alpha\in(0,1)$ . We make a distinction when referring to $2^{n^{o(1)}}$ runtimes. $2^{\tilde{O}(n^{\alpha})}$ -time $\frac{9.01}{\alpha}$ approximation algorithm for Uniform Decision Tree. Assuming the Exponential Time Hypothesis (ETH) [IP01, IPZ01]333ETH states that there are no $2^{n^{o(1)}}$ time algorithms for 3SAT., this algorithm implies that any superconstant approximation of Uniform Decision Tree is not NP-hard. Our work adds approximating Uniform Decision Tree to a small list of natural problems whose time complexity is known to be subexponential (and, for some approximation ratios, $2^{n^{o(1)}}$ ) but not known to be polynomial. Examples of such problems include Factoring [LLMP93], Unique Games [Kho02, ABS15], Graph Isomorphism [Bab16], and approximating Nash Equilibrium [LMM03, Rub18], with the later two having $2^{n^{o(1)}}$ -time algorithms. Like in our analysis of the greedy algorithm, our subexponential time algorithm gives a similar approximation guarantee even for slightly nonuniform weights, in particular when $\frac{p_{\max}}{p_{\min}}\leq 2^{O(1/\alpha)}$ .

•

Approximation ratio tightness. We prove that the $O(\frac{\log n}{\log C_{\textnormal{OPT}}})$ approximation ratio for the greedy algorithm is tight for Uniform Decision Tree. We also prove that the $O(\log\frac{p_{\max}}{p_{\min}})$ term in the approximation ratio for the greedy algorithm is necessary, in the sense that no algorithm can give a $o(\log\frac{p_{\max}}{p_{\min}})$ approximation for Decision Tree when $\frac{p_{\max}}{p_{\min}}=n^{r}$ for some $r\in(0,1)$ unless P=NP.

•

Repeatable, noisy tests. Kääriäinen[Kää06] provides a method to convert a solution for Decision Tree into a solution for a variant of Decision Tree that handles noisy, repeatable tests. An immediate corollary of our result for the greedy algorithm is that the cost of a solution for the noisy problem derived from the greedy algorithm is at most $C_{\textnormal{OPT}}\cdot O(\log n\log\log n)$ . Previously, this cost was bounded by $C_{\textnormal{OPT}}\cdot O(\log^{2}n\log\log n)$ .

1.2 Techniques

Our work gives a new analysis of the greedy algorithm for Decision Tree. A key technical contribution of this work is to leverage upper bounds of Min Sum Set Cover and Set Cover for (Uniform) Decision Tree. Previously, only connections in the reverse direction (i.e. lower bounds) were known between these problems: NP-hardness of attaining a $4-\varepsilon$ -approximation for Uniform Decision Tree was proved by reduction from Min Sum Set Cover, and NP-hardness of attaining a $o(\log n)$ approximation for Decision Tree was proved by reduction from Set Cover [CPR*+*11].

At a high level, our analysis goes as follows. By a simple double counting argument, we can compute the cost of a tree by summing the “weights” of the tree’s interior vertices, rather than summing the depths of the hypotheses. However, rather than accounting for all the interior vertices at once, we separately analyze the vertices with “imbalanced” splits and those with “balanced” splits. Carefully choosing the definition of balanced and imbalanced is a key idea of the proof: previous analyses of the greedy algorithm [KPB99, CPR*+*11, GB09, AH12] either make no distinction between interior vertices or use a different distinction. A global entropy argument accounts for the vertices with balanced splits. For the vertices with imbalanced splits, we use the fact that the greedy algorithm gives a constant factor approximation for Min Sum Set Cover [FLT04]. For Uniform Decision Tree, putting the two bounds together gives the desired approximation result. For the general Decision Tree problem, we additionally prove and use a generalization of a result on the greedy algorithm’s performance for Set Cover [Lov75, Joh74, Chv79, Ste74].

For the subexponential time algorithm, we leverage our new result that the greedy algorithm gives an $O(\frac{\log n}{\log C_{\textnormal{OPT}}})$ approximation. We first run the greedy algorithm. If the greedy algorithm returns a tree with cost at least $n^{3\alpha/4}$ , we return the greedy tree knowing we have an $O(1/\alpha)$ approximation. Otherwise, we find by brute force the “optimal tree up to depth $n^{\alpha}$ ” in time $2^{\tilde{O}(n^{\alpha})}$ , then recurse.

1.3 Organization of paper

In Section 2, we formally introduce notation used throughout the paper. In Section 3, we state our results. In Section 4, we sketch a proof of Theorem 3.1, that the greedy algorithm gives an $O(\frac{\log n}{\log C_{\textnormal{OPT}}}+\log\frac{p_{\max}}{p_{\min}})$ approximation on Decision Tree. Since the proof of Theorem 3.1 is involved, we prove the special case of Theorem 3.1 for Uniform Decision Tree with binary tests in Section 6, and give the full proof in Appendix A. In Section 5, we state the subexponential time approximation algorithm and give a sketch of the analysis, and we give a formal analysis in Section 7. In Section 8, we describe some related work. In Section 9, we conclude with some open problems.

We leave some details to the appendices. A lemma on the greedy algorithm’s performance in a generalization of Set Cover that is used in the proof Theorem 3.1 is proved in Appendix B. In Appendix C, we prove Propositions 3.3 and 3.4, which show two ways that Theorem 3.1 is tight. In Appendix D, we demonstrate a rounding trick that allows us to assume $p_{\min}\geq\frac{1}{\operatorname*{poly}(n)}$ without changing the difficulty of approximating Decision Tree.

2 Preliminaries

For a positive integer $a$ , let $[a]=\{1,\dots,a\}$ . All logs are base 2 when the base is not specified. The Decision Tree problem is as follows: given a set of hypotheses $[n]$ with probabilities $p_{1},\dots,p_{n}$ summing to 1, and $m$ distinct $K$ -ary tests, output a decision tree $\mathcal{T}$ with hypotheses as leaves, such that the weighted average of the depth of the leaves is minimal. Formally, a $K$ -ary test is a map $\tau:[n]\to[K]$ . We refer to $K$ as the branching factor of the test $\tau$ , and the elements of $[K]$ as the possible answers to the tests. We think of a test $\tau$ as defining a $K$ -way partition of $[n]$ . A decision tree $\mathcal{T}$ is a rooted tree such that each interior vertex $v$ has the index $j_{v}\in[m]$ of some test, and the edge to the $i$ -th child of $v$ is labeled with $i\in[K]$ . We say that a hypothesis $h\in[n]$ is consistent with a vertex $v$ if, in the root-to- $v$ path, the edge following any vertex $u$ has label $\tau_{j_{u}}(h)$ . We let $L(v)$ denote the set of hypotheses $h$ that are consistent with $v$ . We say a decision tree $\mathcal{T}$ is complete if, for all $h\in[n]$ , there exists a (unique) leaf $v\in\mathcal{T}$ such that $L(v)=\{h\}$ , and for a complete decision tree $\mathcal{T}$ , let $d_{\mathcal{T}}(h)$ denote the depth of this vertex $v$ . The cost of a complete decision tree $\mathcal{T}$ is defined to be the average depth of the leaves, weighted by $p_{h}$ , i.e.

[TABLE]

We set $\mathcal{T}_{\textnormal{OPT}}$ to be a complete decision tree that minimizes $C(\mathcal{T}_{\textnormal{OPT}})$ (in general, there may be more than one optimal decision tree), and abbreviate $C_{\textnormal{OPT}}\stackrel{{\scriptstyle\rm def}}{{=}}C(\mathcal{T}_{\textnormal{OPT}})$ .

This paper is concerned with the greedy algorithm for Decision Tree. We call a decision tree greedy if the test $\tau_{j_{v}}$ of each interior vertex $v$ minimizes the (weighted) number of hypotheses of the largest partition in $\tau_{j_{v}}$ ’s partitioning of $L(v)$ . Formally, a decision tree $\mathcal{T}$ is greedy if, for all interior vertices $v\in\mathcal{T}$ , we have

[TABLE]

where $p(S)=\sum_{h\in S}p_{h}$ for $S\subseteq[n]$ . Given a Uniform Decision Tree instance, we let $\mathcal{T}_{G}$ be a complete, greedy decision tree, choosing one arbitrarily if there is more than one. For brevity, we write $C_{G}\stackrel{{\scriptstyle\rm def}}{{=}}C(\mathcal{T}_{G})$ .

We remark that, when $K>2$ , our notion of a “greedy” algorithm for Decision Tree is not the only one. As mentioned in the previous paragraph, our definition of greedy chooses, at each vertex in the decision tree, the test that minimizes the (weighted) number of candidate hypotheses, assuming a worst-case answer to the test. Our definition corresponds to the definition by [CPRS09], but other choices include maximizing the (weighted) number of pairs of hypotheses that are distinguished [CPR*+*11, GB09] and maximizing the mutual information between the test and the remaining hypotheses [ZRB05]. For binary tests, $K=2$ , these definitions are all equivalent.

Define $\textnormal{DT}(R)$ as Decision Tree with the guarantee that $\frac{p_{\text{max}}}{p_{\text{min}}}\leq R$ . In this notation, $\textnormal{DT}(1)$ is Uniform Decision Tree.

3 Our results

3.1 Greedy algorithm

The main driver of this paper is Theorem 3.1, which relates the cost of the greedy algorithm to the optimal cost for Decision Tree.

Theorem 3.1.

For any instance of Decision Tree on $n$ hypotheses, we have

[TABLE]

Our theorem holds for any branching factor $K$ , and when $C_{G}$ is the cost of an arbitrary tree produced by the greedy algorithm above. As $C_{\textnormal{OPT}}\geq\log_{K}n$ always, our result implies that the greedy algorithm always gives an $O(\frac{\log n}{\log\log n})$ approximation for Uniform Decision Tree when the branching factor is a constant, resolving the conjecture of [KPB99]. Additionally, if $C_{\textnormal{OPT}}$ is $\Omega(n^{\alpha})$ for constant $\alpha$ and the weights are uniform, then the greedy algorithm obtains a constant $O(1/\alpha)$ approximation. We use this fact crucially in designing our subexponential time approximation algorithms.

For the simpler case when $K=2$ and the weights $p_{h}$ are uniform, we give a sketch of the proof in Section 4 and a full proof in Section 6. This full result is sketched in Section 4 and proven in full in Appendix A. For Uniform Decision Tree, the constant 12 can be improved to 6, and, when $n$ is sufficiently large, $4+\varepsilon$ , so that greedy gives a $\frac{(4+\varepsilon)\log n}{\log C_{\textnormal{OPT}}}$ approximation (see Section 4).

Note that the terms $\log(\frac{1}{p_{\text{min}}})$ and $\ln(\frac{p_{\text{max}}}{p_{\text{min}}})$ in the approximation ratio can be arbitrarily large. However, a rounding trick before running the greedy algorithm [GB09] allows us assume that all the weights are at least $\frac{1}{n^{2}}$ , and hence $\log(\frac{1}{p_{\text{min}}})\leq 2\log n$ and $\ln(\frac{p_{\text{max}}}{p_{\text{min}}})\leq 2\ln n$ in (3). The details are given in Appendix D.

3.2 Subexponential time algorithm

Using Theorem 3.1, we give a subexponential time algorithm that achieves a constant factor approximation for the Decision Tree problem when the weights are close to uniform.

Theorem 3.2.

For any $\alpha\in(0,1)$ and $R\geq 1$ , there exists an $(\frac{25}{\alpha}+\log R)$ approximation algorithm for $\textnormal{DT}(R)$ with runtime $2^{O(n^{\alpha}\log(Rn)\log m)}$ . For Uniform Decision Tree, for any $\varepsilon>0$ , we can achieve a $\frac{9+\varepsilon}{\alpha}$ approximation in the same runtime.

In Section 5, the subexponential time algorithm is stated and an analysis is sketched. The analysis is given formally in Section 7. Importantly, this result implies that achieving a super-constant approximation ratio is not NP-hard, given the Exponential Time Hypothesis. As an informal proof, suppose for contradiction there was a polynomial reduction from 3-SAT to achieving a $f(n)$ approximation ratio for Uniform Decision Tree for some $f(n)\to\infty$ as $n\to\infty$ . By Theorem 3.2, there exists a $2^{n^{o(1)}}$ -time algorithm to achieve a $f(n)$ approximation for Uniform Decision Tree, and thus a $2^{n^{o(1)}}$ -time algorithm to solve 3-SAT, contradicting the Exponential Time Hypothesis. This adds approximating Uniform Decision Tree to a list of interesting natural problems that have subexponential or $2^{n^{o(1)}}$ time algorithms but are not known to be in P. Figure 1 illustrates the contrast between Decision Tree and Uniform Decision Tree.

3.3 Approximation ratio tightness

We also show that the $O(\frac{\log n}{\log C_{\textnormal{OPT}}})$ approximation ratio is tight up to a constant factor for the greedy algorithm by generalizing the example given by [Das04]. The proof is given in Appendix C.1.

Proposition 3.3.

There exists an $n_{0}$ such that for all $n\geq n_{0}$ and any $C^{*}\in[\log n,n]$ , there exists an instance of Uniform Decision Tree with branching factor 2 for which

[TABLE]

We also show that, when the weights are non-uniform, the $\ln\left(\frac{p_{\text{max}}}{p_{\text{min}}}\right)$ term in the approximation ratio of Theorem 3.1 is computationally necessary.

Proposition 3.4.

Let $r\in(0,1)$ . Then, for $n$ sufficiently large, approximating $\textnormal{DT}(2n^{r}\log n)$ to a factor of $\frac{r}{12}\log n$ is NP-hard.

In other words, even if the ratio $p_{\text{max}}/p_{\text{min}}$ is guaranteed to be $O(n^{r})$ for a constant $r\in(0,1)$ , one cannot give a $o(\log n)$ approximation algorithm unless $\text{P}=\text{NP}$ . The proof is given in Appendix C.2.

3.4 Decision tree with noise

Theorem 3.1 implies an improved black-box result for a noisy variant of Decision Tree. Kääriäinen [Kää06] considers a variant of Decision Tree with binary tests where the output of each test may be corrupted by i.i.d. noise. Formally, there exists $\varepsilon>0$ such that querying any test $\tau$ on any hypothesis $h$ , outputs the correct answer $\tau(h)$ with probability $1-\delta_{\tau,h}$ and the wrong answer with probability $\delta_{\tau,h}$ , for some $\delta_{\tau,h}\in[0,1/2-\varepsilon)$ . Tests are repeatable, with each one producing different draws of the noise. Kääriäinen [Kää06] gives an algorithm that turns a decision tree of cost $C$ for the noiseless problem into a decision tree with cost $O(C\log C\log\log C)$ for the noisy problem by repeating queries sufficiently many times.

Combining Kääriäinen’s result with the greedy algorithm for Uniform Decision Tree gives an algorithm for the noisy problem using an average of $O(C_{G}\log C_{G}\log\log C_{G})$ queries. Previously, using the bound $C_{G}\leq\max(C_{\textnormal{OPT}}\cdot O(\log n),n)$ , the noisy problem’s cost was bounded by $C_{\textnormal{OPT}}\cdot O(\log^{2}n\log\log n)$ . However, by Theorem 3.1, we have $C_{G}\leq C_{\textnormal{OPT}}\cdot O(\frac{\log n}{\log C_{\textnormal{OPT}}})$ , so we in fact have cost at most $C_{\textnormal{OPT}}\cdot O(\log n\log\log n)$ , improving the cost ratio to the optimal solution of the noiseless problem by a nearly quadratic factor.

4 Sketch of proof of Theorem 3.1

In this section, we sketch a proof of Theorem 3.1. We first sketch the proof assuming that the branching factor is 2, so that $\mathcal{T}_{G}$ is a binary tree, and that the distribution is uniform ( $p_{h}=1/n$ for all $h\in[n]$ ). Since the proof of Theorem 3.1 is involved, we give the details of this easier result in Section 6. At the end of the section, we give the additional ideas necessary to complete the full proof of Theorem 3.1. The details of the full proof are given in Appendix A.

4.1 Uniform weights and binary tests

Recall that $p(v)\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{h\in L(v)}p_{h}$ and that, as the weights are uniform, $p_{h}=\frac{1}{n}$ for all $h\in[n]$ . By a simple double counting argument (Lemma 6.2), we can compute the cost of the greedy tree by summing the weights of the vertices rather than summing the depths of leaves. That is,

[TABLE]

where the sum is over the interior vertices of $\mathcal{T}_{G}$ .

Defining balanced and imbalanced vertices.

We then define balanced and imbalanced vertices with respect to a parameter $\delta\in(0,1)$ , which we eventually set to $O(\frac{1}{C_{\textnormal{OPT}}})$ . These definitions are crucial to the proof. A vertex is imbalanced444We remark that imbalanced vertices can have $p(v^{-})$ arbitrarily close to $\frac{p(v)}{2}$ , so the hypotheses at vertex $v$ are not necessary split in an imbalanced way. However, as we show (Lemma 6.7), all balanced vertices are in fact split in a balanced way with $p(v^{-})/p(v)\geq\delta/2$ , hence the terminology. if there exists an integer $s$ (called the level) such that $p(v)>2\delta^{s}$ and $p(v^{-})\leq\delta^{s}$ . Here, $v^{-}$ is the child of $v$ containing a smaller weight of hypotheses in its subtree. We say $v$ is balanced if it is not imbalanced. Note that imbalanced vertices exist only for $s\leq s_{\text{max}}$ , where $s_{\text{max}}\stackrel{{\scriptstyle\rm def}}{{=}}\frac{\log n}{\log\frac{1}{\delta}}$ . We prove a structural result (Lemma 6.5) that shows that the level- $s$ imbalanced vertices of $\mathcal{T}_{G}$ can be partitioned into downward paths, which we call chains, such that, for all $s$ , each leaf has vertices from at most one level- $s$ chain among its ancestors. The parameter $\delta$ quantifies how many chains we consider: smaller $\delta$ means fewer, longer chains, and larger $\delta$ means more, shorter chains. We optimize the choice of $\delta$ at the end of this proof sketch. In the remainder of the proof, we bound the weight of the balanced and imbalanced vertices separately.

Bounding the weight of balanced vertices.

To bound the weight of balanced vertices, we use an entropy argument. We consider the random variable corresponding to a uniformly random hypothesis from $[n]$ . On one hand, this random variable has entropy $\log n$ . On the other hand, we can take a uniformly random hypothesis from $[n]$ by an appropriate random walk down the decision tree. Starting from the root, at each vertex, we step to a child with probability proportional to the number of hypotheses in that child’s subtree. The total entropy of this process is given by $\sum_{v\in\mathcal{T}_{G}}p(v)\mathbb{H}(v)$ , where $\mathbb{H}(v)\stackrel{{\scriptstyle\rm def}}{{=}}\mathbb{H}(\frac{p(v^{-})}{p(v)})$ is the entropy of the random walk’s step at $v$ . A simple argument (Lemma 6.7) shows that, for all balanced vertices $v$ , we have $p(v^{-})/p(v)\geq\delta/2$ and hence $\mathbb{H}(v)\geq\mathbb{H}(\delta/2)\geq\frac{\delta}{2}\log\frac{2}{\delta}$ . We thus have

[TABLE]

Hence,

[TABLE]

Bounding the weight of imbalanced vertices.

To bound the cost of imbalanced vertices, we crucially use a connection to Min Sum Set Cover (MSSC). In MSSC, one is given a universe $S$ and sets $A_{1},\dots,A_{M}$ , and needs to construct an ordering $\sigma:[M]\to[M]$ of the sets that minimizes the cost: the cost of a solution $\sigma$ is the average of the cover times of the elements in the universe $S$ . That is, the cost $\textnormal{MSSC}(\sigma)$ of a solution $\sigma$ is

[TABLE]

A result by Feige, Lovasz, and Tetali shows that the greedy algorithm gives a 4 approximation of MSSC, and they show this is tight by proving that finding a $(4-\varepsilon)$ approximation of MSSC is NP-hard. On the lower bound side, a connection between MSSC and Decision Tree was already known: Chakaravarthy et al. [CPR*+*11] proved that it is NP-hard to approximate Uniform Decision Tree with ratio between than $4-\varepsilon$ by a reduction to MSSC. The key technical contribution of our work is showing that there is also a connection on the upper bound side. Bounding the weight of imbalanced vertices works as follows.

For each chain $P$ , define a corresponding instance $\textnormal{MSSC}^{(P)}$ (Definition 6.9) of

Min Sum Set Cover induced by the chain $P=(P_{1},\dots,P_{|P|})$ as follows:

•

Universe $S\stackrel{{\scriptstyle\rm def}}{{=}}L(P_{1})$ , the set of all hypotheses that are consistent with $P_{1}$ .

•

For $j=1,\dots,m$ , the set $A_{j}$ is the set of hypotheses in $S$ that give the minority answer of test $j$ with respect to hypotheses $S$ . (See Figure 2).

•

For each $h=1,\dots,n$ , a set $A_{m+h}\stackrel{{\scriptstyle\rm def}}{{=}}\{h\}\cap S$ . These tests are included for technical reasons.

Note we have a total of $m+n$ sets, so that a solution is a permutation $\sigma:[m+n]\to[m+n]$ . The sets $A_{j}$ for $j=1,\dots,m$ are chosen so that the second step below holds. 2. 2.

Prove that the weight of a chain $P$ is bounded by the cost of a greedy solution to MSSC*(P)* (Lemma 6.13), and hence, using a result of Feige, Lovasz, and Tetali (Theorem 6.12), by 4 times the optimal cost of MSSC*(P)* (Corollary 6.14). That is, there exists a greedy solution $\sigma_{G}$ to $\textnormal{MSSC}^{(P)}$ such that

[TABLE]

This step is somewhat technical, as one must show that the greediness of the greedy decision tree $\mathcal{T}_{G}$ produces a greedy solution to $\textnormal{MSSC}^{(P)}$ . The choice of $\sigma_{G}$ is natural: for $\ell=1,\dots,|P|$ , let $\sigma_{G}(\ell)$ be the index of the test used at vertex $P_{\ell}$ in the chain $P$ (see Figure 3). However, showing that this $\sigma_{G}$ is in fact a greedy solution to $\textnormal{MSSC}^{(P)}$ is a subtle argument that depends on the carefully chosen definition of a chain.

Prove that, for any integer $s$ , the sum, over all level- $s$ chains $P$ , of optimal cost of MSSC*(P)*, is bounded by $C_{\text{OPT}}$ (Lemma 6.15). Hence,

[TABLE]

This step is also technical, as one must draw the connection between the optimal MSSC solution and the optimal decision tree. 4. 4.

In total, we have

[TABLE]

where the first inequality is by part 2 and the second inequality is by part 3. In other words, for any integer $s$ , the sum of the weights of all level $s$ chains is at most $4C_{\text{OPT}}$ . Hence, the sum of the weights of vertices in any chain, and thus the total weight of all imbalanced vertices, is at most $4s_{\text{max}}C_{\text{OPT}}$ (Lemma 6.16), where $s_{\text{max}}$ is the number of levels. As $s_{\text{max}}\leq\frac{\log n}{\log\frac{1}{\delta}}$ , we have

[TABLE]

To finish the proof, we bound

[TABLE]

The above is optimized roughly when $\delta=\frac{1}{C_{\text{OPT}}}$ , giving the desired bound of $C_{G}\leq\frac{6\log n}{\log C_{\text{OPT}}}\cdot C_{\text{OPT}}$ . If $n$ is sufficiently large, taking $\delta=\frac{10}{\varepsilon C_{\text{OPT}}}$ yields $C_{G}\leq\frac{(4+\varepsilon)\log n}{\log C_{\textnormal{OPT}}}\cdot C_{\text{OPT}}$ .

4.2 General weights and larger $K$

The proof of the general Theorem 3.1 follows similarly to the specific case given above. The two differences are that Theorem 3.1 is stated for general $K$ and for general, not-necessarily-uniform distributions $p_{1},\dots,p_{n}$ .

Adapting the proof to general $K$ is the easier step. The main difference is the definition of an imbalanced vertex. Now, we say a vertex $v$ is imbalanced if there is an integer $s$ such that $p(v)>2\delta^{s}$ and $p^{-}(v)\leq\delta^{s}$ , where $p^{-}(v)$ is the total weight of hypotheses in the subtrees of all children of $v$ except the majority vertex, $v^{+}$ , the child of $v$ with the largest weight of hypotheses. Under this definition, a similar analysis follows. Note that $p^{-}(v)$ could be much larger than $p(v^{+})$ in this case, but this does not affect the proof much. A little more care needed in the entropy argument for balanced vertices, and with the MSSC instance defined by a path $P$ now taking $A_{j}$ to be all hypotheses that do not take the majority answer of $\tau_{j}$ with respect to the MSSC universe. Note that, if we specialize to $K=2$ , the value $p^{-}(v)$ is simply $p(v^{-})$ .

In the weighted case, we again define $v$ to be imbalanced if there is an integer $s$ such that $p(v)>2\delta^{s}$ and $p^{-}(v)\leq\delta^{s}$ . We again bound the cost of the balanced vertices by an entropy argument, and the cost of the imbalanced vertices via a connection to Min-Sum-Set-Cover. However, because the entities are now weighted, we need to consider the greedy algorithm for a weighted generalization of MSSC called Weighted Min-Sum-Set-Cover (WMSSC). In order to make the condition between the greedy decision tree and the greedy solution to WMSSC, we need a somewhat technical definition: call a vertex $v$ is $h$ -heavy if $h$ is consistent with $v$ and $p_{h}>p^{-}(v)$ . Define $q(v)=p_{h}$ if there exists $h$ such that $v$ is $h$ -heavy, and set $q(v)=0$ otherwise. One can easily check that, for any vertex $v$ , there is at most one $h$ such that $v$ is $h$ -heavy, so $q(v)$ is well defined. Now, we follow the argument in the uniform case, bounding

[TABLE]

where $\sigma_{G}^{(P)}$ and $\sigma_{\text{OPT}}^{(P)}$ are the greedy solution and optimal solution, respectively, to the corresponding WMSSC. The first inequality holds because $p(v)-q(v)\geq 0$ for all $v$ and every imbalanced vertex is in some chain555It is inequality because some imbalanced vertices may be in multiple chains. The second inequality holds by a technical lemma (Lemma A.15) comparing the greedy decision tree with a greedy solution to WMSSC. Just as for MSSC, the greedy algorithm gives a 4 approximation for WMSSC, so the third inequality holds. Additionally, for all $s=1,\dots,s_{\text{max}}$ , we can still bound $\sum_{P\in\mathcal{P}_{s}}\textnormal{WMSSC}^{(P)}(\sigma_{\text{OPT}})$ , the sum of all WMSSC costs in a single level, by $C_{\text{OPT}}$ , so the fourth inequality holds. To finish:

[TABLE]

The last inequality (Lemma A.20) comes from comparing, for fixed $h$ , the vertices $v$ of the greedy tree $\mathcal{T}_{G}$ that are $h$ -heavy to an appropriate SET-COVER instance, and using the fact that the greedy algorithm on a weighted generalization of SET-COVER gives a $1+\ln\left(\frac{p_{\text{max}}}{p_{\text{min}}}\right)$ approximation (Theorem A.19).

5 Sketch of proof of Theorem 3.2

5.1 Algorithm

We describe the algorithm that achieves a $\frac{25}{\alpha}+\log R$ approximation for $\textnormal{DT}(R)$ . In Section 7, we give the details and describe how the same algorithm with minor adjustments gives an improved approximation guarantee for Uniform Decision Tree.

The key idea in the algorithm is that, if the optimal tree has cost at least $n^{3\alpha/4}$ , then the greedy algorithm gives an $O(1/\alpha)$ approximation by Theorem 3.1. Fix $b\stackrel{{\scriptstyle\rm def}}{{=}}\lceil{(12\log n+\log R)n^{\alpha}}\rceil$ . Our algorithm first computes the greedy tree. If the cost of the greedy tree is at least $n^{-\alpha/4}b$ , we simply return the greedy tree. Otherwise, we perform an exhaustive search over decision trees of depth at most $b$ such that all hypotheses not consistent with vertices at depth $b$ are uniquely distinguished. We choose such a tree $\mathcal{T}$ with minimum cost (see definition of $C(\mathcal{T};H)$ below). Finally, at each leaf $v$ of $\mathcal{T}$ at depth $b$ , we recursively compute a decision tree that distinguishes the hypotheses consistent with $v$ . The runtime of this algorithm is dominated by the exhaustive search, which we can solve in time $2^{\tilde{O}(n^{\alpha})}$ using a divide-and-conquer algorithm.

Let $C(\mathcal{T};H)$ denote the cost of a decision tree $\mathcal{T}$ with respect to hypothesis set $H$ , given by

[TABLE]

where $d_{\mathcal{T}}(h)$ is the depth of the deepest vertex of $\mathcal{T}$ consistent with $h$ . In this way, we have $C(\mathcal{T})=C(\mathcal{T};[n])$ . To solve the Decision Tree instance, we run Fulltree ${}_{\alpha}([n])$ below.

5.2 Analysis sketch

We now sketch an analysis of the algorithm. First, it is easy to check that FullTree ${}_{\alpha}([n])$ returns a valid decision tree. By Theorem 3.1, when the greedy tree is used in the recursive call FullTree ${}_{\alpha}(H)$ , it gives an $O(1/\alpha)+\log R$ approximation to the $\textnormal{DT}(R)$ instance induced by $H$ . Hence, by careful bookkeeping, the greedy trees included in the output tree contribute at most $(O(1/\alpha)+\log R)C_{OPT}$ to the cost (Lemma 7.4). If the greedy tree is not used, then, in the optimal tree, the weighted average depth of the hypotheses is at most $n^{3\alpha/4}$ . Hence, by a simple counting argument, at each recursive call, the fraction of undistinguished hypotheses shrinks by a factor of $n^{\alpha/4}$ , so the maximum depth of recursive calls is $O(1/\alpha)$ (Lemma 7.6). Careful bookkeeping shows that, for any $i=1,\dots,O(1/\alpha)$ , the outputs to PartialTree $(H,b)$ called from the $i$ th level of recursion collectively contribute at most $C_{\text{OPT}}$ to the cost of the output tree (Lemma 7.5). Hence, the trees computed by exhaustive search across all levels of recursion contribute a cost of $O(C_{\text{OPT}}/\alpha)$ . Hence, the cost of our output tree is $(O(1/\alpha)+\log R)C_{OPT}$ .

6 Proof of Theorem 3.1 for uniform weights and $K=2$

We prove a special case of Theorem 3.1 when $K=2$ and the weights $p_{1}=\cdots=p_{n}=\frac{1}{n}$ are uniform, that is, we show that the Uniform Decision Tree with binary tests gives an $O(\frac{\log n}{\log C_{\textnormal{OPT}}})$ approximation. Throughout this section, we have a Uniform Decision Tree instance with hypotheses $[n]$ and tests $\tau_{1},\dots,\tau_{m}:[n]\to[2]$ .

Theorem 6.1.

For any instance of the Uniform Decision Tree problem on $n$ hypotheses with branching factor 2, and any greedy tree $\mathcal{T}_{G}$ with average cost $C_{G}$ , we have

[TABLE]

6.1 Notation

We use the following notation for our proof. These notations help us reason about the greedy tree. We write $v\in\mathcal{T}_{G}$ to mean that $v$ is a vertex of tree $\mathcal{T}_{G}$ , and we write $v\in\mathcal{T}_{G}^{\mathrm{o}}$ to mean that $v$ is a interior vertex. We say the length of a path in the tree is the number of edges along the path. For $v,u\in\mathcal{T}_{G}$ , we say $v$ is an ancestor of $u$ if there is a (possibly degenerate) path from $v$ to $u$ going down the tree. In particular, $v$ is an ancestor of $v$ . We write this as $u\sqsubseteq v$ . We call $u$ a descendant of $v$ if and only if $v$ is an ancestor of $u$ . For $v\in\mathcal{T}_{G}$ , let $L(v)\subseteq[n]$ denote the set of hypotheses consistent with $v$ . For a subset $S\subseteq[n]$ of hypotheses, denote its weight or cost by $p(S)\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{h\in S}p_{h}=\frac{|S|}{n}$ . For brevity, let $p(v)\stackrel{{\scriptstyle\rm def}}{{=}}p(L(v))$ , denoting the weight of vertex $v$ , and we say the weight of a set of vertices is the sum of the weights of the individual vertices in the set.

6.2 The basic argument

The following lemma shows that, rather than accounting the cost of the greedy tree by summing the depths of the leaves associated with the hypotheses, we can instead account the cost by summing the weights of vertices of the tree.

Lemma 6.2.

We have $C_{G}=\sum_{v\in\mathcal{T}_{G}^{\mathrm{o}}}p(v)$ .

Proof.

We have,

[TABLE]

where, in the third equality, we switched the order of summation. ∎

At a high level, our proof defines balanced and imbalanced vertices (next subsection) using a parameter $\delta$ and bound the weight of the balanced and imbalanced vertices separately. We bound the weight of the balanced vertices by an entropy argument, and the weight of the imbalanced vertices by partitioning the imbalanced vertices into paths, called chains, and bounding the weights of each chain separately. Overall, we get the following bound.

[TABLE]

Choosing $\delta=\frac{1}{C_{\textnormal{OPT}}}$ gives $C_{G}\leq\frac{6\log n}{\log C_{\textnormal{OPT}}}\cdot C_{\textnormal{OPT}}$ .

For the rest of the proof, fix $\delta=\frac{1}{C_{\textnormal{OPT}}}$ . Additionally, for convenience and without loss of generality, assume that our instance is nontrivial, i.e. there is some test $j$ such that both of $\tau^{-1}_{j}(1)$ and $\tau^{-1}_{j}(2)$ have at least 2 hypotheses, as otherwise the greedy tree is optimal and $C_{G}=C_{\textnormal{OPT}}$ and the theorem is true.

6.3 More notation: Majority and minority answers

We define majority (minority) answers, edges, children. These definitions are useful for defining balanced and imbalanced vertices. We later show that imbalanced vertices form paths whose edges are majority edges. We call these paths chains. We then analyze the balanced and imbalanced vertices separately, and in particular analyze each path of majority edges separately.

For each vertex $v$ in the greedy tree, let $j_{v}$ denote the test used at $v$ . For each vertex $v$ , label its children by $v^{+}$ and $v^{-}$ so that $p(v^{+})\geq p(v^{-})$ , with ties broken666any tiebreaking procedure suffices, as long as the tiebreaking is consistent with the $k_{j,S}^{+}$ and $k_{j,S}^{-}$ notation in the next paragraph. by labeling $v^{+}$ by the vertex corresponding to a test outputting 1.777it is possible to have a vertex that has one child, namely a test that doesn’t distinguish any pairs of hypotheses at a vertex, but such a test is useless and never appears in either the greedy or optimal tree, so we assume such vertices don’t exist. Accordingly, we have $p(v)=p(v^{+})+p(v^{-})$ for all $v\in\mathcal{T}_{G}^{\mathrm{o}}$ . Call the edge from $v$ to $v^{+}$ a majority edge, and the edge from $v$ to $v^{-}$ a minority edge. This is illustrated in Figure 4.

In order to reason about the greedy tree precisely, we use the following notation which is more technical. For test $j\in[m]$ and hypotheses $S\subseteq[n]$ , let $k^{+}_{j,S}\in[2]$ be the answer to test $\tau_{j}$ that accounts for the maximum weight of hypotheses in $S$ , and let $k^{-}_{j,S}$ be the other index, with ties broken by $k^{+}_{j,S}=1$ . In other words, $k^{+}_{j,S}$ and $k^{-}_{j,S}$ are chosen so that $|\tau_{j}^{-1}(k^{+}_{j,S})\cap S|\geq|\tau_{j}^{-1}(k^{-}_{j,S})\cap S|$ . We call $k^{+}_{j,S}$ the majority answer of test $j$ with respect to hypothesis set $S$ . Call the other answer $k^{-}_{j,S}$ the minority answer of test $j$ with respect to hypothesis set $S$ . For all $j\in[m]$ and $S\subseteq[n]$ , let

[TABLE]

We think of $I_{j,S}^{+}$ ( $I_{j,S}^{-}$ ) as the set of hypotheses that, under test $j$ , output the majority (minority) answer to test $j$ with respect to set $S$ . Note that, with the above notation, we have $L(v^{+})=I_{j_{v},L(v)}^{+}\cap L(v)$ and $L(v^{-})=I_{j_{v},L(v)}^{-}\cap L(v)$ .

The following is a key property of the greedy tree $\mathcal{T}_{G}$ : the weight of hypotheses consistent with the minority child $v^{-}$ decreases as we descend the tree.

Lemma 6.3.

For any vertices $u$ of $v$ with $u$ a descendant of $v$ , we have $p(v^{-})\geq p(u^{-})$ .

Proof.

Because $\mathcal{T}_{G}$ was constructed greedily, for all $v\in\mathcal{T}_{G}^{\mathrm{o}}$ , the test $j_{v}$ was chosen to maximize the weight of $I_{j_{v},L(v)}^{-}\cap L(v)$ , the hypotheses in $L(v)$ giving the minority answer $k_{j_{v},L(v)}^{-}$ . Hence, any other test, in particular, the test $j_{u}$ chosen at vertex $u$ , has a smaller weight of hypotheses of $L(v)$ that give the minority answer of $j_{u}$ with respect to hypotheses $L(v)$ . Hence, we have $p(v^{-})\geq p(I_{j_{u},L(v)}^{-}\cap L(v))$ . Hence,

[TABLE]

The second inequality holds because $L(u)\subseteq L(v)$ . The third inequality holds because test $j_{u}$ defines a partition of $[n]$ into two parts, and $I_{j_{u},L(v)}^{-}$ is one of the two parts, so $I_{j_{u},L(v)}^{-}\cap L(u)$ is one of $L(u^{-})$ or $L(u^{+})$ . ∎

6.4 Defining balanced and imbalanced vertices

In the following definition, we identify balanced vertices and imbalanced vertices. By Lemma 6.2, we can separately bound the weights of the balanced and imbalanced vertices.

Definition 6.4.

Let $s$ be a positive integer.

We say a vertex $v\in\mathcal{T}_{G}^{\mathrm{o}}$ is level- $s$ imbalanced if $p(v^{-})\leq\delta^{s}$ and $p(v)>2\delta^{s}$ . 2. 2.

We say a vertex $v$ is imbalanced if it is level- $s$ imbalanced for some $s$ , and balanced otherwise. 3. 3.

We say a level- $s$ imbalanced vertex $v$ is minimal if no descendant of $v$ is also level- $s$ imbalanced vertex, and a level- $s$ imbalanced vertex $v$ is maximal if no ancestor of $v$ is level- $s$ imbalanced.

Let

[TABLE]

and note that level- $s$ imbalanced vertices exist only for $s\leq s_{\max}$ . The following lemma proves a structural result about balanced vertices, with the punchline being item (iii), which permits Definition 6.6. For an illustration, see Figure 5.

Lemma 6.5.

Let $s$ be a positive integer.

(i)

If $v$ is a level- $s$ * imbalanced vertex, then, among the children of $v$ , only $v^{+}$ can be a level- $s$ imbalanced vertex.* 2. (ii)

Additionally, if $v$ and $u$ are level- $s$ * imbalanced vertices and $v$ is an ancestor of $u$ , then every vertex on the path from $v$ to $u$ is a level- $s$ imbalanced vertex.* 3. (iii)

Finally, the set of level- $s$ * imbalanced vertices can be partitioned into vertex disjoint paths, each of which connects a maximal level- $s$ imbalanced vertex to a minimal level- $s$ imbalanced vertex and contains only majority edges.*

Proof.

For (i), note that if $v$ is level- $s$ imbalanced, then $p(v^{-})\leq\delta^{s}<2\delta^{s}$ , so $v^{-}$ cannot be level- $s$ imbalanced. Hence, among the children of $v$ , only $v^{+}$ can be level- $s$ imbalanced.

For (ii), let $v\sqsupseteq w\sqsupseteq u$ be three vertices in the tree. Suppose that $v$ and $u$ are level- $s$ imbalanced. We know that $p(v)\geq p(w)\geq p(u)>2\delta^{s}$ , and Lemma 6.3 gives $\delta^{s}\geq p(v^{-})\geq p(w^{-})\geq p(u^{-})$ . Hence $w$ is level- $s$ imbalanced.

For (iii), note that each level- $s$ imbalanced vertex has a maximal level- $s$ imbalanced ancestor (possibly itself), so we may partition the level- $s$ imbalanced vertices into sets based on their maximal level- $s$ imbalanced ancestor. We claim each set in the partition is a connected path. Let $v_{1}$ be the (unique) maximal level- $s$ imbalanced vertex in a set $P$ . For $\ell=1,2,\dots$ , if $v_{\ell}$ has a level- $s$ imbalanced child, let $v_{\ell+1}$ be that child, which is unique by the first item and in $P$ by definition. Let $\ell_{P}$ be the largest index such that $v_{\ell_{P}}$ is defined. Then $v_{\ell_{P}}$ has no level- $s$ imbalanced children. We claim $v_{1},\dots,v_{\ell_{P}}$ are the only vertices in the set $P$ . Suppose not. Let $\ell$ be the largest index such that $v_{\ell}$ has a level- $s$ imbalanced descendant $u$ not among $v_{1},\dots,v_{\ell_{P}}$ . Then, by the second item, every vertex on the path from $v_{\ell}$ to $u$ is level- $s$ imbalanced. If $\ell=\ell_{P}$ , this means $v_{\ell_{P}}$ has a level- $s$ imbalanced child, a contradiction of the maximality of $\ell_{P}$ . Otherwise, as $\ell$ is maximal, $v_{\ell+1}$ is not on the path from $v_{\ell}$ to $u$ , in which case, by (ii), $v_{\ell}^{-}$ is level- $s$ imbalanced, which contradicts (i). Thus, we always have a contradiction, so $P$ is the path $v_{1},\dots,v_{\ell_{P}}$ . By (i), every edge along $P$ is a majority edge. This completes the proof. ∎

Lemma 6.5 motivates the following definition.

Definition 6.6.

Let $s$ be a positive integer. A level- $s$ chain, $P=(P_{1},\dots,P_{|P|})$ , is a path of level- $s$ imbalanced vertices starting at a maximal level- $s$ imbalanced vertex and ending at a minimal level- $s$ imbalanced vertex. By Lemma 6.5, the level- $s$ chains partition the level- $s$ imbalanced vertices. We therefore let $\mathcal{P}_{s}$ denote the set of level- $s$ chains.

In general, for $s\neq s^{\prime}$ , a level- $s$ chain might overlap with a level- $s^{\prime}$ chain.

6.5 Bounding the weight of balanced vertices

We first prove a lemma that justifies the choice of the word “balanced”.

Lemma 6.7.

For every balanced vertex $v$ , we have $p(v^{-})\geq\frac{\delta}{2}p(v)$ .

Proof.

Assume for contradiction that $p(v^{-})<\frac{\delta}{2}p(v)$ . Let $s^{\prime}>1$ be the real number such that $p(v^{-})=\delta^{s^{\prime}}n$ . In this way, $p(v^{-})\leq\delta^{\lfloor s^{\prime}\rfloor}n$ . Then

[TABLE]

This implies that $v$ is level- $\lfloor s^{\prime}\rfloor$ imbalanced, so $v$ is imbalanced, a contradiction. ∎

We now bound the contribution of the balanced vertices to the weight using an entropy argument.

Lemma 6.8.

We have

[TABLE]

Proof.

For a vertex $v$ with a test of index $j$ , let $X_{v}$ denote the binary random variable equal to $\tau_{j}(h)$ for an hypothesis $h$ chosen uniformly at random from the hypotheses of $L(v)$ . Let $\mathbb{H}(\cdot)$ denote the entropy of a random variable. The entropy of a uniformly random hypothesis in $[n]$ equals $\log n$ . On the other hand, we can pick a uniformly random hypothesis in $[n]$ by starting at the root vertex $v$ , sampling an answer $X_{v}\in[2]$ for the test at $v$ , stepping to $u$ , the child of $v$ corresponding to the chosen answer, and repeating with $v=u$ , until we reach a leaf. In this process, at any vertex $v$ , the probability of stepping to a child $u$ is exactly $\frac{p(u)}{p(v)}$ . Hence, by a simple induction, the probability of reaching any vertex $v$ in the tree during this process is exactly $p(v)$ . The total entropy of this process is thus $\sum_{v\in\mathcal{T}_{G}}p(v)\cdot\mathbb{H}(X_{v})$ , as $p(v)$ is the probability of reaching vertex $v$ . For a balanced vertex $v$ , Lemma 6.7 implies $\frac{\delta}{2}\leq p(v^{-})\leq\frac{1}{2}$ . Hence,

[TABLE]

We conclude

[TABLE]

and rearranging gives the desired result. ∎

6.6 Bounding the weight of imbalanced vertices

We now bound the weight of imbalanced vertices using a connection to MSSC.

6.6.1 Defining Min Sum Set Cover

Recall that $I_{j,S}^{+}=\tau_{j}^{-1}(k_{j,S}^{+})$ and $I_{j,S}^{-}=\tau_{j}^{-1}(k_{j,S}^{-})$ for all $j\in[m]$ and $S\subseteq[n]$ .

Definition 6.9.

Let $\textnormal{MSSC}^{(P)}$ denote the instance of Min Sum Set Cover that is induced by the chain $P=(P_{1},\dots,P_{|P|})$ . This instance is given by

•

universe $S\stackrel{{\scriptstyle\rm def}}{{=}}L(P_{1})$ ,

•

for $j=1,\dots,m$ , sets $A_{j}\stackrel{{\scriptstyle\rm def}}{{=}}I_{j,S}^{-}\cap S$ , and

•

for each $h=1,\dots,n$ , a singleton set $A_{m+h}\stackrel{{\scriptstyle\rm def}}{{=}}\{h\}\cap S$ .888Some of these sets are empty, but we include them for notational convenience.

Note we have a total of $m+n$ sets. A solution to the instance $\textnormal{MSSC}^{(P)}$ is a permutation $\sigma:[m+n]\to[m+n]$ corresponding to an ordering of the sets, and the cost $\textnormal{MSSC}^{(P)}(\sigma)$ of a solution is the average of the cover times of the elements in the universe $S$ . Formally,

[TABLE]

Note that the cost of any solution is finite, as each hypothesis $h\in L(v)$ is in some set $A_{j}$ . We sometimes refer to a solution $\sigma$ by the sets $A_{\sigma(1)},\dots,A_{\sigma(m+n)}$ .

Remark 6.10.

Since the initial Uniform Decision Tree instance always has a solution, any two hypotheses can be distinguished by one of the $m$ tests. Hence, there is at most one hypothesis $h\in S$ such that, for all $j=1,\dots,m$ , we have $h\notin A_{j}$ . In other words, all but one of the sets $A_{m+h}$ for $h\in[n]$ is not used.

Definition 6.11.

We say a solution $\sigma:[m+n]\to[m+n]$ to $\textnormal{MSSC}^{(P)}$ is greedy at index $\ell$ if the set $A_{\sigma(\ell)}$ covers the maximum number of elements not covered by sets $A_{\sigma(1)},\dots,A_{\sigma(\ell-1)}$ . We say a solution $\sigma:[m+n]\to[m+n]$ is greedy if it is greedy at index $\ell$ for all $\ell\in[m+n]$ ,

Note that, in the case of ties, there may be multiple greedy solutions to $\textnormal{MSSC}^{(P)}$ . Note also that, for any partial assignment $\sigma(1),\dots,\sigma(\ell)$ , one can always complete the solution greedily, so that $\sigma:[m+n]\to[m+n]$ is greedy at indices $\ell+1,\ell+2,\dots,m+n$ . Definition 6.11 lets us leverage the following theorem, due to Feige, Lovász, and Tetali.

Theorem 6.12 (Theorem 1 of [FLT04]).

The greedy algorithm gives a 4-approximation to the MSSC problem. Formally, let $\sigma_{G}$ be any greedy solution to the instance $\textnormal{MSSC}^{(P)}$ , and let $\sigma_{\textnormal{OPT}}$ denote an optimal solution to $\textnormal{MSSC}^{(P)}$ . We have

[TABLE]

6.6.2 Bounding chain weight above by MSSC cost

The section shows that the weight of a chain $P$ is bounded by the cost of its corresponding MSSC instance. To do this, we need the following technical lemma which shows that following the choices of the greedy tree yields a greedy solution to the MSSC, and hence the two weights are comparable.

Lemma 6.13.

Let $s$ be a positive integer and let $P$ be a level- $s$ chain. Then there exists a greedy solution $\sigma:[m+n]\to[m+n]$ to $\textnormal{MSSC}^{(P)}$ , such that

[TABLE]

Proof.

Let $P=(P_{1},\dots,P_{|P|})$ . Let $S=S^{(P)}$ be the universe and $A_{1},\dots,A_{m+n}$ be the sets of the instance $\textnormal{MSSC}^{(P)}$ . For $\ell=1,\dots,|P|$ , let $j_{\ell}$ be the test used at vertex $P_{\ell}$ . Define a solution $\sigma:[m+n]\to[m+n]$ to $\textnormal{MSSC}^{(P)}$ by setting $\sigma(\ell)=j_{\ell}$ for $\ell=1,\dots,|P|$ , and completing the solution greedily. We claim $\sigma$ is a greedy solution. To prove this, we show the following.

(i)

For all $j\in[m]$ and $1\leq\ell\leq|P|$ , the majority answer for test $j$ with respect to $S=L(P_{1})$ is the same as the majority answer for test $j$ with respect to $L(P_{\ell})$ . Equivalently, for all $j\in[m]$ and $1\leq\ell\leq|P|$ , we have $A_{j}=I_{j,S}^{-}\cap S=I_{j,L(P_{\ell})}^{-}\cap S$ . As an immediate consequence, we know $A_{j_{\ell}}$ contains all of $L(P_{\ell}^{-})$ and none of $L(P_{\ell}^{+})$ . 2. (ii)

The set of hypotheses of $S$ not covered by $A_{j_{1}},\dots,A_{j_{\ell-1}}$ is exactly $L(P_{\ell})$ . 3. (iii)

For each $1\leq\ell\leq|P|$ , among sets $A_{1},\dots,A_{m+n}$ , the set $A_{j_{\ell}}$ covers the maximum number of hypotheses in $L(P_{\ell})$ . i.e. we have

[TABLE]

These points suffice, as (ii) and (iii) tell us that $\sigma$ is greedy at indices $1,\dots,|P|$ , so by construction $\sigma$ is greedy.

To show (i), fix $j\in[m]$ and $\ell\in\{1,\dots,|P|\}$ . As $P_{\ell}$ is level- $s$ imbalanced, we also have $p(P_{\ell})>2\delta^{s}$ and $p(P^{-}_{\ell})\leq\delta^{s}$ and $p(P^{+}_{\ell})>\delta^{s}$ , so $k_{j,L(P_{\ell})}^{+}$ accounts for more than half of the hypotheses in $L(P_{\ell})$ . On the other hand, as $P_{1}$ is level- $s$ imbalanced, we have $p(I_{j,S}^{-}\cap L(P_{\ell}))\leq p(I_{j,S}^{-}\cap S)\leq p(P^{-}_{1})\leq\delta^{s}$ , so the majority answer for test $j$ with respect to hypothesis set $S$ also accounts for more than half of the hypotheses in $L(P_{\ell})$ . Hence $k_{j,S}^{+}=k_{j,L(P_{\ell})}^{+}$ for all $j\in[m]$ and $1\leq\ell\leq|P|$ .

Item (ii) follows because $L(P_{\ell})$ is the set of hypotheses consistent with $P_{\ell}$ , which was obtained by following the majority edges from $P_{1}$ . This means $L(P_{\ell})$ contains all the hypotheses of $S$ not consistent with a minority child of one of $P_{1},\dots,P_{\ell-1}$ . By (i), this is exactly $S\setminus(A_{j_{1}}\cup\cdots\cup A_{j_{\ell-1}})$ .

For (iii), at vertex $P_{\ell}$ in the greedy decision tree, the test index $j=j_{\ell}$ maximizes the weight $p(I_{j,L(P_{\ell})}^{-}\cap L(P_{\ell}))$ . By (i), this index $j$ equivalently maximizes $p(A_{j}\cap L(P_{\ell}))$ , as desired. This completes the proof that $\sigma$ is greedy.

We now return to the proof of Lemma 6.13. Take the greedy solution $\sigma$ given above. For $1\leq\ell\leq|P|$ , the set of vertices of $S$ not covered by $A_{\sigma(1)},\dots,A_{\sigma(\ell-1)}$ is exactly $L(P_{\ell})$ , which has weight $p(P_{\ell})$ . Hence, by (27),

[TABLE]

as desired. ∎

By Theorem 6.12, we have the following immediate corollary.

Corollary 6.14.

Let $P$ be any chain, and $\sigma_{\textnormal{OPT}}$ be the optimal solution to $\textnormal{MSSC}^{(P)}$ . Then

[TABLE]

6.6.3 Bounding MSSC cost above by $C_{\textnormal{OPT}}$

We now show that the optimal MSSC solution can be compared to the optimal decision tree cost, $C_{\textnormal{OPT}}$ . For all chains $P$ , let $\sigma_{G}^{(P)}:[m+n]\to[m+n]$ be a greedy solution to $\textnormal{MSSC}^{(P)}$ given by Lemma 6.13, and let $\sigma_{\textnormal{OPT}}^{(P)}$ be an optimal solution to $\textnormal{MSSC}^{(P)}$ .

Lemma 6.15.

Let $s$ be a positive integer. We have

[TABLE]

Proof.

Let $S^{(P)}$ be the universe of the instance $\textnormal{MSSC}^{(P)}$ , and let $A_{1},\dots,A_{m+n}$ be the sets. Construct a path $w_{1},\dots,w_{\ell^{*}}$ in $\mathcal{T}_{\textnormal{OPT}}$ such that $w_{1}$ is the root and $w_{\ell^{*}}$ is a leaf, which is identified with some hypothesis $h^{*}$ , and, for $\ell=1,\dots,\ell^{*}-1$ , if the test at vertex $w_{\ell}$ has index $j_{\ell}$ , the edge to its child $w_{\ell+1}$ corresponds to the answer $k_{j_{\ell},S^{(P)}}^{+}$ , the majority answer of test $j_{\ell}$ with respect to set $S^{(P)}$ . Since we follow the edges with label $k^{+}_{j_{\ell},S^{(P)}}$ , this corresponds to following the path for an hypothesis contained in $I_{j_{\ell},S^{(P)}}^{+}$ . In other words, we have, for $\ell=1,\dots,\ell^{*}-1$ ,

[TABLE]

Thus the sequence $A_{j_{1}},\dots,A_{j_{\ell^{*}}},A_{m+h^{*}}$ covers $S^{(P)}$ , and thus gives a valid solution $\sigma_{\text{TREE}}$ to the instance $\textnormal{MSSC}^{(P)}$ , where $\sigma_{\text{TREE}}(1)=j_{1},\sigma_{\text{TREE}}(2)=j_{2},\dots,\sigma_{\text{TREE}}(\ell^{*})=j_{\ell^{*}},\sigma_{\text{TREE}}(\ell^{*}+1)=m+h^{*}$ , and $\sigma_{\text{TREE}}$ on larger indices is arbitrarily chosen. Note that the depth of the leaf for hypothesis $h$ is at least the number of vertices of $w_{1},\dots,w_{\ell^{*}}$ that are on the root-to-leaf path of $h$ , and this number is $\min\{\ell:h\in A_{j_{\ell}}\}$ , except for $h^{*}$ , in which case it is 1 smaller. Furthermore, for some $h\in S^{(P)}$ , the depth of leaf $h$ is at least $\min\{\ell:h\in A_{j_{\ell}}\}+1$ , because otherwise all the branches leaving the path have one leaf, which can only happen if our Uniform Decision Tree instance is trivial, and it is not trivial by assumption. Thus,

[TABLE]

The $\frac{-1}{n}$ and $\frac{1}{n}$ account for the lower order terms described above. Summing (6.6.3) over $P\in\mathcal{P}_{s}$ gives

[TABLE]

where in the first inequality, we used that every leaf has at most one maximal level- $s$ imbalanced ancestor, and hence it is in at most one MSSC universe $S^{(P)}$ . ∎

6.6.4 Bounding imbalanced vertex weight above by $C_{\textnormal{OPT}}$

We now finish our bound of the weight of imbalanced vertices.

Lemma 6.16.

We have

[TABLE]

Proof.

Each imbalanced vertex $v$ is level- $s$ imbalanced for some positive integer $s$ , so it is part of some level- $s$ chain, $P$ . Hence,

[TABLE]

The first inequality is not equality because some vertices may be level- $s$ imbalanced for more than one integer $s$ . The second inequality is by Corollary 6.14. The third inequality is by Lemma 6.15. ∎

6.7 Finishing the proof

Proof of Theorem 3.1.

We have

[TABLE]

as desired. In the first inequality, we used Lemma 6.8 and Lemma 6.16. ∎

7 Proof of Theorem 3.2

For the entirety of this section, fix $\alpha<1$ , $R\geq 1$ , and an instance of $\textnormal{DT}(R)$ . We first analyze Algorithm 1, showing that running FullTree ${}_{\alpha}([n])$ gives a $(\frac{25}{\alpha}+\log R)$ approximation for $\textnormal{DT}(R)$ , and then describe how the algorithm can be modified to give a $\frac{9.01}{\alpha}$ approximation for Uniform Decision Tree in subexponential time.

7.1 Runtime

Lemma 7.1.

Algorithm 2 runs in time $O(n(Km)^{b+1})$ .

Proof.

Each call to Algorithm 2 calls at most $mK$ recursive calls with one less depth. This means the total number of recursive calls is $O((Km)^{b})$ . Since the local runtime of each call is $O(nmK)$ , the total runtime is $O(n(Km)^{b+1})$ . ∎

Lemma 7.2.

Algorithm 1 runs in time $O(n^{3}(Km)^{b+1})=2^{O(n^{\alpha}\log(Rn)\log m)}$ .

Proof.

The cost is dominated by the cost of Algorithm 2. The depth of recursive calls is at most $n$ and the width of the recursive call tree is at most $n$ , thus the total runtime is at most $n^{2}$ times the runtime of Algorithm 2. Thus the runtime is $O(n^{3}(Km)^{b+1})$ ∎

7.2 Notation

To formally analyze the approximation guarantees of Algorithm 1, we need to generalize some earlier definitions. We say a decision tree $\mathcal{T}$ is complete with respect to hypothesis set $H$ up to depth $b$ if, for all hypotheses $h\in H$ , either there exists a leaf $v$ of $\mathcal{T}$ with $L(v)\cap H=\{h\}$ , or $d_{\mathcal{T}}(h)\geq b$ . Note that $\mathcal{T}$ is complete with respect to hypothesis set $H$ if it is complete up to depth $b$ for all $b$ .

Given the tests $\tau_{1},\dots,\tau_{m}:[n]\to[2]$ and a subset $H$ of hypotheses, let $\Phi_{H}$ be the $\textnormal{DT}(R)$ instance induced by $H$ . It is given by hypotheses $H$ and the tests $\tau_{1},\dots,\tau_{m}:H\to[2]$ restricted to domain $H$ . It is easy to check that, for this instance, we indeed have $\frac{p_{\max}^{\prime}}{p_{\min}^{\prime}}\leq R$ . We let $\mathcal{T}_{G}(H)$ denote a greedy tree for the instance $\Phi_{H}$ . We let $\mathcal{T}_{\textnormal{OPT}}(H)$ denote an optimal tree for instance $\Phi_{H}$ . We also define the optimal tree for $\Phi_{H}$ up to depth $b$ by

[TABLE]

Importantly, $\mathcal{T}_{\textnormal{OPT}}(H,b)$ is computable by a straightforward recursive algorithm in time $O(n(mK)^{b})$ by Algorithm 2. For convenience, let

[TABLE]

In this way, we have

[TABLE]

7.3 Approximation guarantee

It is easy to see that PartialTree $(H,b)$ computes $\mathcal{T}_{\textnormal{OPT}}(H,b)$ (or one such tree, if there are several). Let $\mathcal{F}_{i}$ be the family of hypothesis sets $H$ such that FullTree ${}_{\alpha}(H)$ is called at the $i$ th level of recursion and the greedy tree $\mathcal{T}_{G}(H)$ is not returned. We consider FullTree ${}_{\alpha}([n])$ to be the 0th level of recursion so that $\mathcal{F}_{0}=\{[n]\}$ . Let $\mathcal{F}_{\text{greedy}}$ denote the family of hypothesis sets $H$ such that FullTree ${}_{\alpha}(H)$ is called and $\mathcal{T}_{G}(H)$ is returned in that call. Let $\mathcal{F}_{\text{greedy}}=\mathcal{F}_{\text{greedy}}^{(1)}\cup\mathcal{F}_{\text{greedy}}^{(2)}$ be a partition such that $H\in\mathcal{F}_{\text{greedy}}^{(2)}$ if and only if $p(H)<\frac{1}{n^{2}}$ We know that, for any $H$ and $H^{\prime}$ , if FullTree ${}_{\alpha}(H^{\prime})$ is called recursively from FullTree ${}_{\alpha}(H)$ , then $H^{\prime}\subseteq H$ . Thus, under these definitions, we know that, for all $i$ , the hypothesis sets of $\mathcal{F}_{i}$ are pairwise disjoint, and the hypothesis sets of $\mathcal{F}_{\text{greedy}}$ are pairwise disjoint.

By a double-counting argument, the cost of the output tree is the weighted sum of the partial trees and greedy trees computed in the recursion. Formally, if $\mathcal{T}_{out}$ is the output tree,

[TABLE]

Our proof bounds the depth of the recursion, as well as the summand components.

Lemma 7.3.

Let $\{H_{i}\}$ be a collection of disjoint subsets of $[n]$ . Then $\sum_{i}p(H_{i})C_{\textnormal{OPT}}(H_{i})\leq C_{\textnormal{OPT}}$

Proof.

Recall that $\mathcal{T}_{\textnormal{OPT}}\stackrel{{\scriptstyle\rm def}}{{=}}\mathcal{T}_{\textnormal{OPT}}([n])$ is the optimal tree of the $\textnormal{DT}(R)$ instance. By optimality of $\mathcal{T}_{\textnormal{OPT}}(H_{i})$ for the instance instance induced by $H_{i}$ , we have $C_{\textnormal{OPT}}(H_{i})=C(\mathcal{T}_{\textnormal{OPT}}(H_{i});H_{i})\leq C(\mathcal{T}_{\textnormal{OPT}};H_{i})$ . Hence,

[TABLE]

Lemma 7.4.

We have

[TABLE]

Proof.

For all hypothesis sets $H\in\mathcal{F}_{\text{greedy}}^{(1)}$ , Algorithm 1 guarantees that $C_{G}(H)\geq(12\log n+\log R)n^{3\alpha/4}$ . By Theorem 3.1, the greedy algorithm gives a $(12\log n+\log R)$ approximation on the $\textnormal{DT}(R)$ instance induced by $H$ . Hence, $C_{\textnormal{OPT}}(H)\geq n^{3\alpha/4}$ for all $H\in\mathcal{F}{\text{greedy}}^{(1)}$ . By Theorem 3.1 again, for all $H\in\mathcal{F}{\text{greedy}}^{(1)}$ , we have

[TABLE]

Hence,

[TABLE]

The last inequality is by Lemma 7.3 and the fact that the $H\in\mathcal{F}_{\text{greedy}}$ are disjoint. Adding (48) and (49) gives the desired result ∎

Lemma 7.5.

For all $i=0,1,\dots$ , we have

[TABLE]

Proof.

For any $H\in\mathcal{F}_{i}$ , we have $C_{\textnormal{OPT}}(H,b)\leq C_{\textnormal{OPT}}(H)$ . Summing over all $H\in\mathcal{F}_{i}$ , we have

[TABLE]

where the last inequality follows from Lemma 7.3 and that $H\in\mathcal{F}_{i}$ are disjoint. ∎

Lemma 7.6.

The maximum recursion depth in Algorithm 1 is at most $8/\alpha$

Proof.

We show by induction that, for all $i$ ,

[TABLE]

This suffices, as then, for $i=\lfloor 8/\alpha\rfloor+1$ , we have $\sum_{H\in\mathcal{F}_{i}}p(H)<\frac{1}{n^{2}}$ . By Algorithm 1, we have $p(H)\geq\frac{1}{n^{2}}$ for all $H\in\mathcal{F}_{i}$ (otherwise we take the greedy tree). Thus, the maximum depth of the recursion in Algorithm 1 is less than $\lfloor 8/\alpha\rfloor$ .

Note that equality holds in (52) for $i=0$ , so the base case is true. For the inductive step, fix $i\geq 1$ , and let $H\in\mathcal{F}_{i}$ . Note that

[TABLE]

Consider a random variable $X$ equal to the depth in tree $\mathcal{T}_{\textnormal{OPT}}(H,b)$ of a random hypothesis in $H$ where $h\in H$ is chosen with probability proportional to $p_{h}$ . By above, $\mathbb{E}[X]\leq n^{-\alpha/4}b$ . Hence, by Markov’s inequality, $\Pr[X\geq b]\leq\frac{\mathbb{E}[X]}{b}\leq n^{-\alpha/4}$ . Thus, the total weight of hypotheses in $H$ that are in the next recursive call, i.e. in $H^{\prime}$ for some $H^{\prime}\in\mathcal{F}_{i+1}$ , is at most $n^{-\alpha/4}p(H)$ . This holds for any $H$ , so we conclude

[TABLE]

This completes the induction, proving the lemma. ∎

Lemma 7.7.

Let $\mathcal{T}_{\text{out}}$ be the tree returned by FullTree ${}_{\alpha}([n])$ . Then $C(\mathcal{T}_{\text{out}},[n])\leq(\frac{25}{\alpha}+\log R)C_{\textnormal{OPT}}$ .

Proof.

By (43) and Lemmas 7.4, 7.5, and 7.6, we have

[TABLE]

7.4 Uniform Decision Tree

We now describe how to modify Algorithm 1 to give a $\frac{9+\varepsilon}{\alpha}$ approximation for Uniform Decision Tree in subexponential time. By the remark at the end of Theorem 3.1, for all $\varepsilon>0$ there exists an $n_{0}=n_{0}(\varepsilon)$ such that for all $n\geq n_{0}$ , the greedy algorithm gives a $\frac{(4+\frac{2}{3}\varepsilon)\log n}{\log C_{\textnormal{OPT}}}$ approximation on Uniform Decision Tree. Hence, the following modified greedy algorithm runs in polynomial time and gives a $\frac{(4+\frac{2}{3}\varepsilon)\log n}{\log C_{\textnormal{OPT}}}$ approximation: for $n\geq n_{0}$ , run the greedy algorithm, and for $n<n_{0}$ , compute the optimal tree by brute force in constant time. For Uniform Decision Tree, set $b=\lceil{(4+\frac{2}{3}\varepsilon)\log(n)n^{\alpha}}\rceil$ , use the modified greedy algorithm instead of the greedy algorithm, and return the output of the modified greedy algorithm if $C(\mathcal{T}_{G}(H);H)\geq n^{-\alpha/3}b$ (rather than $n^{-\alpha/4}b$ ) and keep the rest of Algorithm 1 the same. Lemma 7.3 still holds. For uniform weights, we have $F_{\text{greedy}}^{(2)}=\emptyset$ , so $\mathcal{F}_{\text{greedy}}^{(1)}=\mathcal{F}_{\text{greedy}}$ . Similar to Lemma 7.4, we are guaranteed that $C_{\textnormal{OPT}}(H)\geq n^{2\alpha/3}$ for all $H\in\mathcal{F}_{\text{greedy}}$ , and thus

[TABLE]

Lemma 7.5 still holds. In Lemma 7.6 the maximum depth of recursion is now $3/\alpha$ as the weight of hypotheses at each recursive call shrinks by a factor of $n^{\alpha/3}$ and the weight of hypotheses at each nonempty level is at most $\frac{1}{n}$ . Hence, the cost of the output tree has a contribution of at most $\frac{6+\varepsilon}{\alpha}C_{\textnormal{OPT}}$ from the greedy trees and at most $\frac{3}{\alpha}C_{\textnormal{OPT}}$ from the outputs of the PartialTree, for a total cost of at most $\frac{9+\varepsilon}{\alpha}C_{\textnormal{OPT}}$ .

8 Related Work

There have been several other works analyzing Decision Tree and they analyze it in a variety of cases to achieve the gold standard $O(\log n)$ . While we examined the case with $K$ -ary tests and non-uniform weights, we assumed that the tests had equal costs. Other works [GB09, GNR10] analyze the case where the test costs are non-uniform. [GB09] shows that the greedy algorithm yields $O(\log n)$ when either the costs are non-uniform or the weights are non-uniform (with the rounding trick) but not both. [GNR10] introduces a new algorithm that achieves $O(\log n)$ with both non-uniform weights and costs.

In this work we studied the average depth of decision trees. We remark that, in the worst-case decision tree problem, where the cost of a tree is defined to be the maximum depth of a leaf in the tree, the approximability is known. The greedy algorithm gives an $O(\log n)$ approximation [AMM*+*98], and obtaining a $o(\log n)$ approximation is NP-hard [LN04].

For the worst-case decision tree problem, there is a line of work that examines the absolute query rate rather than the query rate relative to the optimal. In this line of work, the chief goal is to identify conditions where the greedy algorithm achieves the information-theoretically optimal rate $O(\log n)$ . One such condition that ensures the $O(\log n)$ rate is “sample-rich” [NJC12], which states that every binary partition of the hypotheses has a test with matching pre-images. [Now09, Now11] introduced the more lenient $k$ -neighborly condition, which requires that every two tests be connected by a sequence of tests where neighboring tests disagree on at most $k$ hypotheses. An even more general condition is the split-neighborly condition [ML18], which is satisfied if every two tests are connected by a sequence of tests where neighboring tests must have every subset of the disagreeing hypotheses be evenly split by some other test.

9 Conclusion

There are several open questions left by our work.

Could one prove hardness of approximation results for Uniform Decision Tree for ratios larger than $4-\varepsilon$ ? It would be interesting to prove either NP-hardness results for larger constant factor approximations, or fine-grained complexity results for larger approximation ratios such as in [MR17]. 2. 2.

On the flip side, could one find faster, perhaps polynomial time algorithms for approximating Uniform Decision Tree for ratios where we now have subexponential time algorithms? 3. 3.

On can also consider a generalization of Decision Tree when the test costs are non-uniform. [GB09, GNR10] Could one obtain similar results in this setting?

10 Acknowledgements

The authors thank Joshua Brakensiek for helpful discussions and feedback on an earlier draft of this paper. The authors thank Mary Wootters for helpful feedback on an earlier draft of this paper. The authors thank anonymous reviews for helpful feedback on an earlier draft of this paper.

Appendix A Proof of Theorem 3.1

We now give a proof of Theorem 3.1, highlighting the differences with the proof of the special case in Section 6, and suppressing parts of the proof that are identical.

A.1 Notation

We reuse all of the notation in Section 6.1. The only difference is that, in this section, $p(S)\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{h\in S}p_{h}$ is not necessarily equal to $\frac{|S|}{n}$ . Just as in Section 6, fix $\delta=\frac{1}{C_{\textnormal{OPT}}}$ .

A.2 The basic argument

Lemma 6.2 is still true, and we restate it for completeness.

Lemma A.1 (Lemma 6.2, restated).

We have $C_{G}=\sum_{v\in\mathcal{T}_{G}^{\mathrm{o}}}p(v)$ .

At a high level, our proof defines balanced and imbalanced vertices (next subsection) using the parameter $\delta$ and bound the weight of the balanced and imbalanced vertices separately. We bound the weight of the balanced vertices by an entropy argument, and the weight of the imbalanced vertices by partitioning the imbalanced vertices into paths, called chains, and bounding the weights of each chain separately. If a vertex $v$ has a heavy hypothesis $h$ (defined in Section A.6.1), we set $q(v)=p_{h}$ , and otherwise we set $q(v)=0$ . To bound the cost of imbalanced vertices, we also need to bound the costs of heavy hypotheses $q(v)$ . Overall, we make get the following bounds.

[TABLE]

A.3 More notation: Majority and minority answers

Again, we define majority (minority) answers, edges, children, which are useful for defining balanced and imbalanced vertices.

For each vertex $v$ in the greedy tree, let $j_{v}$ denote the test used at $v$ . For each vertex $v$ , label its children by $v^{+}$ and $v^{1,-},v^{2,-},\dots,$ so that $p(v^{+})\geq p(v^{\ell,-})$ for all $\ell$ , with ties broken999any tiebreaking procedure suffices, as long as the tiebreaking is consistent with the $k_{j,S}^{+}$ and $k_{j,S}^{-}$ notation in the next paragraph. by labeling $v^{+}$ by the vertex corresponding to the largest answer.101010it is possible to have a vertex that has one child, namely a test that doesn’t distinguish any pairs of hypotheses at a vertex, but such a test is useless and never appears in either the greedy or optimal tree, so we assume it doesn’t exist. Call the edge from $v$ to $v^{+}$ a majority edge111111Here, we may have $p(v^{+})<\frac{1}{2}p(v)$ , so the weight of hypotheses consistent with $v^{+}$ do not necessarily constitute a majority. However, this does difference does not affect the proof, and we keep the wording to stay consistent with Section 6., and the edges from $v$ to $v^{\ell,-}$ minority edges. Call $v^{+}$ the minority child of $v$ and call $v^{1,-},v^{2,-},\dots,$ the minority children of $v$ . Let $L^{-}(v)\stackrel{{\scriptstyle\rm def}}{{=}}\cup_{\ell}L(v^{\ell,-})$ be the hypotheses consistent with the minority children of $v$ , and let their weight be $p^{-}(v)=p(L^{-}(v))$ . Accordingly, we have $p(v)=p(v^{+})+p^{-}(v)$ for all $v\in\mathcal{T}_{G}^{\mathrm{o}}$ . This is illustrated in Figure 4.

In order to reason about the greedy tree precisely, we use the following notation which is more technical. For test $j\in[m]$ and hypotheses $S\subseteq[n]$ , let $k^{+}_{j,S}\in[K]$ be the answer to test $\tau_{j}$ that accounts for the maximum weight of hypotheses in $S$ , with ties broken by choosing the largest indexed answer $k^{+}_{j,S}$ . We call $k^{+}_{j,S}$ the majority answer of test $j$ with respect to hypothesis set $S$ . Call the other answers the minority answers of test $j$ with respect to hypothesis set $S$ . For all $j\in[m]$ and $S\subset[n]$ , let

[TABLE]

We think of $I_{j,S}^{+}$ ( $I_{j,S}^{-}$ ) as the set of hypotheses that, under test $j$ , output the majority (minority) answer to test $j$ with respect to set $S$ . Note that, with the above notation, we have $L(v^{+})=I_{j_{v},L(v)}^{+}\cap L(v)$ and $L^{-}(v)=I_{j_{v},L(v)}^{-}\cap L(v)$ . Under these more general definitions, a generalization of Lemma 6.3 holds. The proof is identical to that of Lemma 6.3, so we omit it.

Lemma A.2.

For any vertices $u$ of $v$ with $u$ a descendant of $v$ , we have $p^{-}(v)\geq p^{-}(u)$ .

A.4 Defining balanced and imbalanced vertices

In the following definition, we identify balanced vertices and imbalanced vertices. By Lemma A.1, we can separately bound the weights of the balanced and imbalanced vertices.

Definition A.3.

Let $s$ be a positive integer.

We say a vertex $v\in\mathcal{T}_{G}^{\mathrm{o}}$ is level- $s$ imbalanced if $p^{-}(v)\leq\delta^{s}$ and $p(v)>2\delta^{s}$ . 2. 2.

We say a vertex $v$ is imbalanced if it is level- $s$ imbalanced for some $s$ , and balanced otherwise. 3. 3.

We say a level- $s$ imbalanced vertex $v$ is minimal if no descendant of $v$ is also level- $s$ imbalanced vertex, and a level- $s$ imbalanced vertex $v$ is maximal if no ancestor of $v$ is level- $s$ imbalanced.

Let

[TABLE]

and note that interior level- $s$ imbalanced vertices exist only for $s\leq s_{\max}$ . The following lemma proves a structural result about balanced vertices, with the punchline being item (iii), which permits Definition A.5. The proof of Lemma A.4 is nearly identical to that of Lemma 6.5. We include a proof of item (i) because of a subtle difference to the proof of item (i) of Lemma 6.5. However, the proofs of the other two parts are identical, so we omit them.

Lemma A.4.

Let $s$ be a positive integer.

(i)

If $v$ is a level- $s$ * imbalanced vertex, then, among the children of $v$ , only $v^{+}$ can be a level- $s$ imbalanced vertex.* 2. (ii)

Additionally, if $v$ and $u$ are level- $s$ * imbalanced vertices and $v$ is an ancestor of $u$ , then every vertex on the path from $v$ to $u$ is a level- $s$ imbalanced vertex.* 3. (iii)

Finally, the set of level- $s$ * imbalanced vertices can be partitioned into vertex disjoint paths, each of which connects a maximal level- $s$ imbalanced vertex to a minimal level- $s$ imbalanced vertex and contains only majority edges.*

Proof of (i).

Note that if $v$ is level- $s$ imbalanced, then $p^{-}(v)\leq\delta^{s}$ , which means every $u$ different from $v^{+}$ satisfies $p(u)\leq p^{-}(v)\leq\delta^{s}<2\delta^{s}$ , so such $u$ cannot be level- $s$ imbalanced. Hence, among the children of $v$ , only $v^{+}$ can be level- $s$ imbalanced. ∎

Lemma A.4 motivates the following definition.

Definition A.5.

Let $s$ be a positive integer. A level- $s$ chain, $P=(P_{1},\dots,P_{|P|})$ , is a sequence of level- $s$ imbalanced vertices starting at a maximal level- $s$ imbalanced vertex and ending at a minimal level- $s$ imbalanced vertex. By Lemma A.4, the level- $s$ chains partition the level- $s$ imbalanced vertices. We therefore let $\mathcal{P}_{s}$ denote the level- $s$ chains.

In general, for $s\neq s^{\prime}$ , a level- $s$ chain might overlap with a level- $s^{\prime}$ chain.

A.5 Bounding the weight of balanced vertices

Under these definitions, a generalization of Lemma 6.7 is still true. The proof is identical to that of Lemma 6.7, so we omit it.

Lemma A.6.

For every balanced vertex $v$ , we have $p^{-}(v)\geq\frac{\delta}{2}p(v)$ .

We now bound the contribution of the balanced vertices to the weight using an entropy argument. Now, in the general case, the entropy argument requires a little more care when bounding the entropy of a single $K$ -ary test.

Lemma A.7.

We have

[TABLE]

Proof.

For a vertex $v$ with a test of index $j$ , let $X_{v}$ denote the random variable supported on $[K]$ that is equal to $\tau_{j}(h)$ for an hypothesis $h$ chosen randomly from the elements of $L(v)$ , where the probability of choosing $h$ is proportional to $p_{h}$ . Let $\mathbb{H}(\cdot)$ denote the entropy of a random variable, and by abuse of notation, let $\mathbb{H}(v)\stackrel{{\scriptstyle\rm def}}{{=}}\mathbb{H}(X_{v})$ . By abuse of notation, for nonnegative $\alpha_{1},\dots,\alpha_{K}$ summing to 1, let $\mathbb{H}(\alpha_{1},\dots,\alpha_{K})=\sum_{i=1}^{K}-\alpha_{i}\log\alpha_{i}$ where $0\log 0$ is taken to be 0. The entropy of a random element $[n]$ chosen according to the prior distribution $\textbf{p}=(p_{1},\dots,p_{n})$ is at most $\log n$ . On the other hand, we can pick a random hypothesis in $[n]$ according to the distribution p by setting $v$ to the root of $\mathcal{T}_{G}$ , sampling an answer $X_{v}\in[K]$ for the test at $v$ , setting $v$ to the child of $v$ corresponding to the chosen answer $X_{v}$ , and repeating, until we reach a leaf. In this process, at any vertex $v$ , the probability of stepping to a child $u$ is exactly $\frac{p(u)}{p(v)}$ . Hence, by a simple induction, the probability of reaching any vertex $v$ in the tree during this process is exactly $p(v)$ . The total entropy of this process is thus $\sum_{v\in\mathcal{T}_{G}}p(v)\cdot\mathbb{H}(v)$ , as $p(v)$ is the probability of reaching vertex $v$ and $\mathbb{H}(v)$ is the entropy of the random variable $X_{v}$ chosen at vertex $v$ .

Fix a balanced vertex $v$ . We claim that $\mathbb{H}(v)\geq\frac{\delta}{2}\log\frac{2}{\delta}$ . Let $\mathcal{R}\subset\mathbb{R}^{K}$ denote the region given by the constraints $0\leq\alpha_{k}\leq\alpha_{1}\leq 1$ for all $k=2,\dots,K$ , and $\alpha_{1}+\cdots+\alpha_{K}=1$ , and $\alpha_{2}+\cdots+\alpha_{K}\geq\delta/2$ . We claim that the minimum of $\mathbb{H}(\alpha_{1},\dots,\alpha_{K})$ for $(\alpha_{1},\dots,\alpha_{K})\in\mathcal{R}$ is $\mathbb{H}(1-\delta/2,\delta/2)$ . To see this, note first that this region $\mathcal{R}$ is closed and bounded, so the function $\mathbb{H}(\alpha_{1},\dots,\alpha_{k})$ obtains a minimum. Furthermore, note that, for $(\alpha_{1},\dots,\alpha_{K})\in\mathcal{R}$ , by concavity of $-x\log x$ , for any $k=2,\dots,K$ and any $\varepsilon\leq\alpha_{K}$ , setting $(\alpha_{1}^{\prime},\dots,\alpha_{K}^{\prime})=(\alpha_{1}+\varepsilon,\alpha_{2},\dots,\alpha_{k-1},\alpha_{k}-\varepsilon,\alpha_{k+1},\dots,\alpha_{K})$ gives $\mathbb{H}(\alpha_{1},\dots,\alpha_{K})>\mathbb{H}(\alpha_{1}^{\prime},\dots,\alpha_{K}^{\prime})$ . Similarly pushing $\alpha_{k}$ and $\alpha_{k^{\prime}}$ apart by the same positive $\varepsilon$ also decreases the value of $H$ . Hence, the maximum cannot be obtained when two of $\alpha_{2},\dots,\alpha_{K}$ are positive, nor can it be obtained when $\alpha_{2}+\cdots+\alpha_{K}>\delta/2$ . It follows that the only local minima in the region occur when some $\alpha_{k}$ is $\frac{\delta}{2}$ and $\alpha_{1}=1-\frac{\delta}{2}$ .

For $k=1,\dots,K$ , let $\alpha_{k}$ denote the probability that $X_{v}=k$ . By Lemma A.6, when $v$ is balanced, $(\alpha_{1},\dots,\alpha_{K})$ must, up to a permutation in coordinates, be in region $\mathcal{R}$ . Hence, by the above we have

[TABLE]

Putting the above two paragraphs together, we conclude

[TABLE]

and rearranging gives the desired result. ∎

A.6 Bounding the weight of imbalanced vertices

We now bound the weight of imbalanced vertices using a connection to Weighted Min Sum Set Cover. For each hypothesis $h$ , let $u_{h}^{\bot}$ denote the leaf in the greedy tree $\mathcal{T}_{G}$ for which hypothesis $h$ is consistent. Since $\mathcal{T}_{G}$ is complete, this leaf exists and is unique.

A.6.1 Technical definition: heavy vertices

We need the following technical definition to make the connection between the greedy decision tree and a greedy WMSSC solution.

Definition A.8.

For a vertex $v$ and an hypothesis $h\in[n]$ , we say $v$ is $h$ -heavy if $h$ is consistent with $v$ and $p_{h}>p^{-}(v)$ .

Lemma A.9.

Let $h\in[n]$ be a hypothesis.

(i)

If $v$ is $h$ -heavy, then every vertex on the path from $v$ to leaf $u_{h}^{\bot}$ is $h$ -heavy. 2. (ii)

Additionally, if $v$ is $h$ -heavy, then every edge on the path from $v$ to leaf $u_{h}^{\bot}$ is a majority edge. 3. (iii)

Lastly, for any vertex $v$ , there exists at most one hypothesis $h$ such that $v$ is $h$ -heavy.

Proof.

Item (i) is true by Lemma A.2, which says that $p^{-}(v)$ decreases as one descends the tree.

For (ii), it suffices to prove, by the first part, that for every $h$ -heavy vertex $v$ , the first edge on the path from $v$ to $h$ is a majority edge. Suppose for contradiction that there exists $h$ and an $h$ -heavy vertex $v$ with a minority child $u$ such that $h$ is a descendant of $u$ . Then $p^{-}(v)\geq p(u)\geq p_{h}$ , which contradicts the definition of $v$ being $h$ heavy.

For (iii), suppose for contradiction there exists two hypotheses $h$ and $h^{\prime}$ such that $v$ is both $h$ -heavy and $h^{\prime}$ -heavy. Since our Decision Tree instance is well defined, there exists some test $j$ that distinguishes $h$ and $h^{\prime}$ , i.e. $\tau_{j}(h)\neq\tau_{j}(h^{\prime})$ . As $v$ is $h$ -heavy, we have $p_{h}>p^{-}(v)$ . Hence the answer for hypothesis $h$ under test $j$ is $k^{+}_{j,L(v)}$ , the answer to $\tau_{j}$ accounting for the maximum weight of hypotheses in $L(v)$ : if not choosing test $j$ at vertex $v$ would make the weight of hypotheses consistent with a minority child $v$ to be $\geq p_{h}$ . This is a contradiction as $p_{h}>p^{-}(v)$ and the tree $\mathcal{T}_{G}$ is greedy. However, as $v$ is also $h^{\prime}$ -heavy, we have, by the same reasoning, that $\tau_{j}(h^{\prime})=k^{+}_{j,L(v)}$ . This is a contradiction, as test $j$ was chosen to distinguish $h$ and $h^{\prime}$ . ∎

We now define some notation for dealing with non-uniform weights $p_{h}$ , which are well-defined by Lemma A.9.

Definition A.10.

For hypothesis $h\in[n]$ , let $u_{h}^{\top}$ be the maximal ancestor of $h$ that is $h$ -heavy. For vertex $v$ , if there exists an $h$ such that $v$ is $h$ -heavy, let $q(v)=p_{h}$ , and otherwise let $q(v)=0$ .

A.6.2 Defining Weighted Min Sum Set Cover

Recall $I_{j,S}^{+}=\tau_{j}^{-1}(k_{j,S}^{+})$ and $I_{j,S}^{-}=[n]\setminus I_{j,S}^{+}$ .

Definition A.11.

Let $\textnormal{WMSSC}^{(P)}$ denote the instance of weighted min sum set cover that is induced by the chain $P=(P_{1},\dots,P_{|P|})$ . This instance is given by

•

universe $S\stackrel{{\scriptstyle\rm def}}{{=}}L(v_{1})$ with weights $(p_{h})_{h\in S}$ ,

•

for $j=1,\dots,m$ , sets $A_{j}\stackrel{{\scriptstyle\rm def}}{{=}}I_{j,S}^{-}\cap S$ , and

•

for each $h=1,\dots,n$ , a set $A_{m+h}\stackrel{{\scriptstyle\rm def}}{{=}}\{h\}\cap S$ consisting of one element.121212Some of these sets are empty, but we include them for notational convenience.

Note we have a total of $m+n$ sets. A solution to the WMSSC problem is a permutation $\sigma:[m+n]\to[m+n]$ corresponding to an ordering of the sets, and the cost $\textnormal{WMSSC}^{(P)}(\sigma)$ of a solution is the weighted sum of the cover times of the elements in the universe $S$ . Formally,

[TABLE]

Note that this instance is well defined, as each hypothesis $h\in L(v)$ is in some set $A_{j}$ . We sometimes refer to a solution $\sigma$ by the sets $A_{\sigma(1)},\dots,A_{\sigma(m+n)}$ .

Remark A.12.

Since the initial Decision Tree instance is well defined, any two elements can be distinguished by one of the $m$ tests. Hence, there is at most one element $h\in L(v)$ such that, for all $j=1,\dots,m$ , we have $h\notin A_{j}$ . In other words, all but one of the sets $A_{m+h}$ for $h\in[n]$ are unnecessary.

Definition A.13.

We say a solution $\sigma:[m+n]\to[m+n]$ to $\textnormal{WMSSC}^{(P)}$ is greedy at index $\ell$ if the set $A_{\sigma(\ell)}$ covers the maximum number of elements not covered by sets $A_{\sigma(1)},\dots,A_{\sigma(\ell-1)}$ . We say a solution $\sigma:[m+n]\to[m+n]$ is greedy if it is greedy at index $\ell$ for all $\ell\in[m+n]$ ,

Note that, in the case of ties, there may be multiple greedy solutions to $\textnormal{WMSSC}^{(P)}$ . Note also that, for any partial assignment $\sigma(1),\dots,\sigma(\ell)$ , one can always complete the solution greedily, so that $\sigma:[m+n]\to[m+n]$ is greedy at indices $\ell+1,\ell+2,\dots,m+n$ . Definition A.13 lets us leverage the following theorem, due to Golovin and Krause, which generalizes Theorem 6.12.131313In fact, [GK11] considers an even more general problems called Adaptive Stochastic Min-Sum Cover.

Theorem A.14 (Theorem 5.10 of [GK11]).

The greedy algorithm gives a 4-approximation to the WMSSC problem. Formally, let $\sigma$ be any greedy solution to $\textnormal{WMSSC}^{(P)}$ , and let $\sigma_{\textnormal{OPT}}$ denote an optimal solution. We have

[TABLE]

A.6.3 Bounding chain weight above by WMSSC cost

Lemma A.15.

Let $s$ be a positive integer and let $P=(P_{1},\dots,P_{|P|})$ be a level- $s$ chain. Then there exists a greedy solution $\sigma:[m+n]\to[m+n]$ to $\textnormal{WMSSC}^{(P)}$ , such that

[TABLE]

Proof.

Let $S$ be the universe of the instance $\textnormal{WMSSC}^{(P)}$ , and let $A_{1},\dots,A_{m+n}$ be the sets. For $\ell=1,\dots,|P|$ , let $j_{\ell}$ be the test used at vertex $P_{\ell}$ . Let $\ell_{0}\leq|P|$ be the largest index such that $P_{\ell_{0}}$ is not $h$ -heavy for any $h$ , or 0 if no such index exists. If $\ell_{0}<|P|$ , let $h_{0}$ be the hypothesis such that $P_{\ell_{0}+1}$ is $h_{0}$ -heavy. By Lemma A.4, all the edges along the path $P$ are majority edges. By Lemma A.9, for all $\ell_{0}<\ell\leq|P|$ , vertex $P_{\ell}$ is $h_{0}$ -heavy. Define a solution $\sigma:[m+n]\to[m+n]$ to $\textnormal{WMSSC}^{(P)}$ as follows.

•

If $\ell_{0}=|P|$ , for $\ell=1,\dots,|P|$ , let $\sigma(\ell)=j_{\ell}$ and complete the solution $\sigma$ greedily.

•

Otherwise, for $1\leq\ell\leq\ell_{0}$ , let $\sigma(\ell)=j_{\ell}$ , let $\sigma(\ell_{0}+1)=m+h_{0}$ , let $\sigma(\ell+1)=j_{\ell}$ for $\ell_{0}<\ell\leq|P|$ , and complete the solution $\sigma$ greedily.

We claim $\sigma$ is a greedy solution. To prove this, we show the following.

(i)

For all $j\in[m]$ and $\ell=1,\dots,|P|$ , the majority answer for test $j$ with respect to vertex $P_{1}$ is the same as the majority answer for test $j$ with respect to vertex $P_{\ell}$ . Equivalently, for all $j\in[m]$ and $\ell=1,\dots,|P|$ , we have $A_{j}=I_{j,S}^{-}\cap S=I_{j,L(P_{\ell})}^{-}\cap S$ . As an immediate consequence, we know $A_{j_{\ell}}$ contains all the hypotheses in $L^{-}(P_{\ell})$ and none of the hypotheses in $L(P_{\ell}^{+})$ .

(ii)

The set of hypotheses of $S$ not covered by $A_{j_{1}},\dots,A_{j_{\ell-1}}$ is exactly $L(P_{\ell})$ .

(iii)

For each $1\leq\ell\leq|P|$ , among sets $A_{1},\dots,A_{m}$ , set $A_{j_{\ell}}$ covers the maximum weight of hypotheses in $L(P_{\ell})$ , i.e. we have

[TABLE]

(iv)

For each $1\leq\ell\leq\ell_{0}$ , among sets $A_{1},\dots,A_{m+n}$ , set $A_{j_{\ell}}$ covers the maximum weight of hypotheses in $L(P_{\ell})$ .

(v)

If $\ell_{0}<|P|$ , then, among sets $A_{1},\dots,A_{m+n}$ , set $A_{m+h_{0}}=\{h_{0}\}$ covers the maximum weight of hypotheses in $L(P_{\ell_{0}+1})$ .

(vi)

If $\ell_{0}<|P|$ , then, for $\ell_{0}<\ell\leq|P|$ , among sets $A_{1},\dots,A_{m+n}$ , set $A_{j_{\ell}}$ covers the maximum weight of hypotheses in $L(P_{\ell})\setminus\{h_{0}\}$ .

These points suffices for proving that $\sigma$ is greedy. If $\ell_{0}=|P|$ , items (ii) and (iv) tell us that $\sigma$ is greedy at indices $1,\dots,|P|$ , so by construction $\sigma$ is greedy. If $\ell_{0}<|P|$ , then (iv), (v), and (vi) tell us that $\sigma$ is greedy at indices $1,\dots,|P|+1$ , so $\sigma$ is greedy.

To show (i), fix $j\in[m]$ and $\ell\in\{1,\dots,|P|\}$ . As $P_{\ell}$ is level- $s$ imbalanced, we also have $p(P_{\ell})>2\delta^{s}$ and $p^{-}(P_{\ell})\leq\delta^{s}$ and $p(P_{\ell}^{+})>\delta^{s}$ , so $k_{j,L(P_{\ell})}^{+}$ is the unique answer in $[K]$ accounting for more than half of the weight of hypotheses in $L(P_{\ell})$ . On the other hand, as vertex $P_{1}$ is level- $s$ imbalanced, we have $p(I_{j,S}^{-}\cap L(P_{\ell}))\leq p(I_{j,S}^{-}\cap S)\leq p^{-}(P_{1})\leq\delta^{s}$ , so the majority answer $k_{j,S}^{+}$ for test $j$ with respect to hypothesis set $S$ is exactly the answer described in the previous sentence. Hence $k_{j,S}^{+}=k_{j,L(P_{\ell})}^{+}$ .

Item (ii) follows because $L(P_{\ell})$ is the set of hypotheses consistent with $P_{\ell}$ , which was obtained by following the majority edges from $P_{1}$ . This means $L(P_{\ell})$ contains all the hypotheses of $S$ not a consistent with a minority child of one of $P_{1},\dots,P_{\ell-1}$ . By the last paragraph, this is exactly $S\setminus(A_{j_{1}}\cup\cdots\cup A_{j_{\ell-1}})$ .

For (iii), at vertex $P_{\ell}$ in the greedy decision tree, the test index $j=j_{\ell}$ maximizes the weight $p(I_{j,L(P_{\ell})}^{-}\cap L(P_{\ell}))$ . By (i), this index $j$ equivalently maximizes $p(A_{j}\cap L(P_{\ell}))$ , as desired.

For (iv), at step $\ell$ for $\ell\leq\ell_{0}$ , by (i), the set $A_{j_{\ell}}$ covers a $p^{-}(P_{\ell})$ weight of hypotheses in $L(P_{\ell})$ , which is more than $p_{h_{0}}$ by definition of $\ell_{0}$ . By (iii), $A_{j_{\ell}}$ covers at least as much weight of hypotheses in $L(P_{\ell})$ as any of $A_{1},\dots,A_{m}$ , and, by Remark A.12 and the previous sentence, at least as much as any of $A_{1},\dots,A_{m+n}$ .

For (v), by maximality of $\ell_{0}$ , we have $p_{h_{0}}>p^{-}(P_{\ell_{0}+1})$ . Hence, the singleton $\{h_{0}\}$ covers more weight of hypotheses in $L(P_{\ell_{0}+1})$ than any of $A_{1},\dots,A_{m}$ , and thus, by Remark A.12, than any of $A_{1},\dots,A_{m+n}$ .

For (vi), if there exists $h_{0}$ such that some $P_{\ell}$ is $h_{0}$ -heavy, then, for any $j=1,\dots,m$ , we have $h_{0}\notin A_{j}$ and $A_{j}\cap L(P_{\ell})=A_{j}\cap(L(P_{\ell})\setminus\{h_{0}\})$ . Hence the $A_{j}$ among $A_{1},\dots,A_{m}$ that covers the most weight of $L(P_{\ell})\setminus\{h_{0}\}$ is $A_{j_{\ell}}$ by (iv). By Remark A.12, the only set among $A_{m+1},\dots,A_{m+n}$ that could cover a larger weight of $L(P_{\ell})\setminus\{h_{0}\}$ is $\{h_{0}\}$ , but it in fact covers 0 weight of $L(P_{\ell})\setminus\{h_{0}\}$ , so among sets $A_{1},\dots,A_{m+n}$ , set $A_{j_{\ell}}$ covers the maximum weight of hypotheses in $L(P_{\ell})\setminus\{h_{0}\}$ . This completes the proof that $\sigma$ is greedy.

We now return to the proof of Lemma A.15. Take the greedy solution $\sigma$ given above. If $\ell_{0}=|P|$ , then the set of vertices of $S$ not covered by $A_{\sigma(1)},\dots,A_{\sigma(\ell-1)}$ is exactly $L(P_{\ell})$ , which has weight $p(P_{\ell})$ . Hence, by (27),

[TABLE]

Now suppose $\ell_{0}<|P|$ . Recall that the definition of $\ell_{0}$ implies $P_{\ell_{0}+1},P_{\ell+2},\dots,P_{|P|}$ are all $h_{0}$ -heavy. Hence, we have $q(P_{\ell})=0$ for $1\leq\ell\leq\ell_{0}$ , and $q(P_{\ell})=p_{h_{0}}$ for $\ell_{0}<\ell\leq|P|$ . Thus,

[TABLE]

as desired. In the last equality, we used that $h_{0}\in L(P_{\ell})$ for $\ell=1,\dots,|P|$ . ∎

Let $\sigma_{G}^{(P)}:[m+n]\to[m+n]$ be the greedy solution to $\textnormal{WMSSC}^{(P)}$ given by Lemma A.15, and let $\sigma_{\textnormal{OPT}}^{(P)}$ be an optimal solution to $\textnormal{WMSSC}^{(P)}$ .

A.6.4 Bounding WMSSC cost above by $C_{\textnormal{OPT}}$

Lemma A.16.

Let $s$ be a positive integer. We have

[TABLE]

Proof.

Let $S^{(P)}$ be the universe of the instance $\textnormal{WMSSC}^{(P)}$ , and let $A_{1},\dots,A_{m+n}$ be the sets. Construct a path $w_{1},\dots,w_{\ell^{*}}$ in $\mathcal{T}_{\textnormal{OPT}}$ such that $w_{1}$ is the root, $w_{\ell^{*}}$ is a leaf for hypothesis $h^{*}$ , and, for $\ell=1,\dots,\ell^{*}-1$ , if the test at vertex $w_{\ell}$ has index $j_{\ell}$ , the edge to its child $w_{\ell+1}$ corresponds to the answer $k_{j_{\ell},S^{(P)}}^{+}$ , the majority answer of test $j_{\ell}$ with respect to set $S^{(P)}$ . Suppose the test at vertex $w_{\ell}$ in the optimal tree has index $j_{\ell}$ . Since we follow the edges with label $k^{+}_{j,S^{(P)}}$ , this corresponds to following the path for an hypothesis contained in $I_{j_{\ell},S^{(P)}}^{+}$ . In other words, we have, for $\ell=1,\dots,\ell^{*}-1$ ,

[TABLE]

Thus the sequence $A_{j_{1}},\dots,A_{j_{\ell^{*}}},\{h^{*}\}$ covers $[n]$ , and hence $S^{(P)}$ , and thus gives a valid solution $\sigma_{TREE}$ to the instance $\textnormal{WMSSC}^{(P)}$ , where $\sigma_{TREE}(1)=j_{1},\sigma_{TREE}(2)=j_{2},\dots,\sigma_{TREE}(\ell^{*})=j_{\ell^{*}},\sigma_{TREE}(\ell^{*}+1)=m+h^{*}$ , and $\sigma_{TREE}$ on larger indices is arbitrarily chosen. Note that the depth of a hypothesis $h$ in the tree is at least the number of vertices of $w_{1},\dots,w_{\ell^{*}}$ that are on the root-to-leaf path of $h$ , and this number is $\min\{\ell:h\in A_{j_{\ell}}\}$ , except for $h^{*}$ , in which case it is 1 smaller. Then we have

[TABLE]

Summing over $P\in\mathcal{P}_{s}$ gives

[TABLE]

∎

Lemma A.17.

We have

[TABLE]

Proof.

Each imbalanced vertex $v$ is level- $s$ imbalanced for some positive integer $s$ , so it is part of some level- $s$ chain, $P$ . Note that $p(v)-q(v)\geq 0$ for all vertices $v$ . Hence,

[TABLE]

The first inequality is because $q(v)\geq 0$ for all $v$ . The second inequality is because $p(v)-q(v)\geq 0$ for all $v$ , and that every imbalanced $v$ is in some chain. The third inequality is by Lemma A.15. The fourth inequality is by Theorem A.14. The fifth inequality is by Lemma A.16. Rearranging gives the desired result. ∎

A.7 Bounding the cost contribution of heavy vertices

We bound the cost contribution of the heavy vertices via a connection to SET-COVER. A theorem due to Lovasz [Lov75], Johnson [Joh74], Chvatal [Chv79], and Stein [Ste74] states that, in any instance of SET-COVER where no set covers more than $h$ elements, the greedy algorithm gives a $1+\ln h$ approximation. We show a generalization of this result, based on the following the definition.

Definition A.18.

Let $\Phi$ be an instance of SET-COVER with a universe $S$ and sets $A_{1},\dots,A_{m}$ . Let $\textbf{p}=(p_{h})_{h\in S}$ be a sequence of weights assigned to the elements of $S$ . A p-weighted greedy algorithm for SET-COVER is repeatedly chooses the set $A_{j}$ that minimizes $\sum_{h\in A_{j}\cap S^{\prime}}p_{h}$ , where $S^{\prime}$ is the set of uncovered elements.

Theorem A.19.

Let $\Phi$ be an instance of SET-COVER with a universe $S$ and sets $A_{1},\dots,A_{m}$ . Let $\textbf{p}=(p_{h})_{h\in S}$ be a sequence of weights assigned to the elements of $S$ . Then, if the optimal solution to $\Phi$ has uses at most $\ell_{\textnormal{OPT}}$ sets, then the p-weighted greedy algorithm uses at most $(1+\ln\frac{p^{*}}{\min_{h}p_{h}})\ell_{\textnormal{OPT}}$ sets, where $p^{*}\stackrel{{\scriptstyle\rm def}}{{=}}\max_{j\in[m]}\sum_{h\in A_{j}}p_{h}$ .

While this argument may be known, we are not aware of a known reference, so we provide a proof for completeness in Appendix B.

We now bound $\sum_{v\in\mathcal{T}_{G}^{\mathrm{o}}}q(v)$ .

Lemma A.20.

For all $h\in[n]$ , we have

[TABLE]

Proof.

Fix $h$ . Let $v_{1}=u_{h}^{\top},\dots,v_{\ell_{G}}=u_{h}^{\bot}$ denote the path from vertex $u_{h}^{\top}$ to leaf $u_{h}^{\bot}$ in the greedy tree. Let $\textnormal{SC}^{(h)}$ denote the SET-COVER instance with the following parameters:

•

Universe $S=L(u_{h}^{\top})\setminus\{h\}$

•

For $j=1,\dots,m$ , sets $A_{j}=I_{j,S}^{-}\cap S$ .

Let $\ell_{\textnormal{OPT}}$ denote the cost of the optimal solution to $\textnormal{SC}^{(h)}$ , and for $\ell=1,\dots,\ell_{G}$ , let $j_{\ell}$ denote the test at vertex $v_{\ell}$ .

We make the following observations. First, by definition of $u_{h}^{\top}$ , for all $j\in[m]$ , the set $I_{j,S}^{-}$ does not contain hypothesis $h$ : vertex $u_{h}^{\top}$ satisfies $p^{-}(u_{h}^{\top})<p_{h}$ so the answer $k^{+}_{j,S}$ of a test $j$ that accounts for the largest weight of hypotheses in $S$ always contains the hypothesis $h$ , as any other answer, by the definition of the greedy algorithm, has weight at most $p^{-}(u_{h}^{\top})$ .

Second, the sets $A_{j_{1}},A_{j_{2}},\dots,A_{j_{\ell_{G}}}$ form a p-weighted greedy solution for this SET-COVER instance $\textnormal{SC}^{(h)}$ , where $\textbf{p}=(p_{h})_{h\in S}$ . For $\ell\leq\ell_{G}$ , the first $\ell-1$ sets in the above sequence cover all of $S$ except the elements of $L(v_{\ell})$ . In the greedy decision tree, the index $j=j_{\ell}$ that maximizes $p(I_{j,L(v_{\ell})}^{-}\cap L(v_{\ell}))$ . Note that, for any $j$ , the set $I_{j,L(v_{\ell})}^{-}$ are the hypotheses for the $K-1$ answers of $\tau_{j}$ that exclude hypothesis $h$ . However, the set $I_{j,S}^{-}$ also contains exactly the hypotheses for the $K-1$ answers of $\tau_{j}$ that exclude hypothesis $h$ . Hence $I_{j,L(v_{\ell})}^{-}\cap S=I_{j,S}^{-}\cap S=A_{j}$ , so $j=j_{\ell}$ maximizes $p(A_{j}\cap L(v_{\ell}))$ . Thus, sets $A_{j_{1}},A_{j_{2}},\dots,A_{j_{\ell_{G}}}$ form a p-weighted greedy solution for this SET-COVER instance $\textnormal{SC}^{(h)}$ . Hence, we may apply Theorem A.19. Since $\max_{j\in[m]}p(A_{j})=p^{-}(u_{h}^{\top})<p_{h}\leq p_{\max}$ and $\min_{h\in S}p_{h}\geq p_{\min}$ , we have, by Theorem A.19,

[TABLE]

It remains to prove $\ell_{\textnormal{OPT}}\leq d_{\textnormal{OPT}}(h)$ . Let $w_{1},\dots,w_{\ell_{T}}=h$ be the root-to-leaf path for hypothesis $h$ in the optimal tree $\mathcal{T}_{\textnormal{OPT}}$ . For $\ell=1,\dots,\ell_{T}-1$ , set $j_{\ell}^{\prime}$ to be the test chosen at vertex $w_{\ell}$ in the optimal tree. By the first point, the set $I_{j,S}^{-}$ contains the hypotheses for the $K-1$ answers of test $\tau_{j}$ that exclude hypothesis $h$ . As the optimal tree is a decision tree, any test is distinguished from $h$ by one of tests $j_{1},\dots,j_{\ell_{T}-1}$ , so $A_{j_{1}^{\prime}},A_{j_{2}^{\prime}},\dots,A_{j_{\ell_{T}-1}^{\prime}}$ cover all of $[n]\setminus\{h\}$ , and thus covers $S$ . Furthermore, $\ell_{T}-1=d_{\textnormal{OPT}}(h)$ , so there is a solution to $\textnormal{SC}^{(h)}$ of size $d_{\textnormal{OPT}}(h)$ . Hence, $d_{\textnormal{OPT}}(h)\geq\ell_{\textnormal{OPT}}$ , as desired. ∎

As a corollary, we have

Lemma A.21.

[TABLE]

Proof.

We have

[TABLE]

The first equality uses Lemma A.9, which tells us that every vertex between $u_{h}^{\top}$ and $h$ is $h$ -heavy, and no such vertex is $h^{\prime}$ -heavy for $h^{\prime}\neq h$ . Furthermore, these are the only $h$ -heavy vertices as $u_{h}^{\top}$ is the maximal $h$ -heavy vertex. The inequality is by Lemma A.20. ∎

A.8 Finishing the proof

Proof of Theorem 3.1.

We have

[TABLE]

as desired. In the last inequality, we used that (i) $\log 2C_{\textnormal{OPT}}>\log C_{\textnormal{OPT}}$ , (ii) $\log n\leq\log\frac{1}{p_{\min}}$ , (iii) $1+C_{\textnormal{OPT}}<2C_{\textnormal{OPT}}$ , and $1\leq\frac{\log\frac{1}{p_{\min}}}{\log C_{\textnormal{OPT}}}$ . ∎

Appendix B Proof of Theorem A.19

We closely follow the argument of Chvatal [Chv79]. Suppose the $\vec{p}$ -weighted greedy algorithm uses $\ell_{G}$ sets. By re-indexing the sets, we may assume without loss of generality that the p-weighted greedy algorithm chooses sets $A_{1},\dots,A_{\ell_{G}}$ in that order. For $r=1,\dots,\ell_{G}$ and $j=1,\dots,m$ , let $A_{j}^{(r)}$ denote the elements of set $A_{j}$ not covered by the first $r-1$ chosen sets. For $r=1,\dots,\ell_{G}$ and $j=1,\dots,m$ , let $\rho_{j}^{(r)}=\sum_{h\in A_{j}^{(r)}}p_{h}$ denote the sum of the weights of the elements of $A_{j}^{(r)}$ . For $h=1,\dots,n$ , let $y_{h}=\frac{1}{\rho_{r}^{(r)}}$ , where $r$ is the index at which element $h$ is first covered. Equivalently, $r$ is the unique index such that $h\in A_{r}^{(r)}$ . In this way, we have

[TABLE]

and, for all $j=1,\dots,m$ ,

[TABLE]

where $\ell_{j}$ is the largest index such that $\rho_{j}^{(\ell_{j})}>0$ . Hence using that $\rho_{j}^{(1)},\rho_{j}^{(2)},\dots,\rho_{j}^{(\ell_{j})}$ is a non-increasing sequence, we have

[TABLE]

Let $J\subset[m]$ denote the indices of the optimal cover for $\Phi$ . Applying (82) and summing (83) for $j\in J$ , we have

[TABLE]

as desired.

Appendix C Tightness of Theorem 3.1

In this section, we prove Propositions 3.3 and 3.4, which show two ways that Theorem 3.1 is tight.

C.1 Proof of Proposition 3.3

Proof.

We prove Proposition 3.3 with the stronger guarantee that $C_{\textnormal{OPT}}\leq 4C^{*}$ when $C^{*}$ is an integer. Then, taking $(C^{*})^{\prime}=\lceil{C^{*}}\rceil\leq 2C^{*}$ gives the desired result.

When $n$ is sufficiently large, for $C^{*}\geq n^{1/4}$ , the statement is trivial, as any instance for which $C_{OPT}\in[\frac{1}{4}C^{*},4C^{*}]$ , satisfies the requirements. Number the hypotheses $1,\dots,n^{*}$ . Let $n^{*}\in(n-C^{*},n]$ be such that $C^{*}|n^{*}$ . Place the hypotheses $1,\dots,n^{*}$ in a grid with $C^{*}$ columns and $r\stackrel{{\scriptstyle\rm def}}{{=}}\frac{n^{*}}{C^{*}}$ rows, numbered $1,\dots,r$ , so that each grid square contains at most 1 hypothesis. Recursively identify a family of good sets of rows as follows: $[r]$ is good, and for every good set $A^{\prime}$ containing $r^{\prime}>1$ rows, create a partition $A^{\prime}=A^{\prime}_{1}\cup A^{\prime}_{2}$ such that $|A^{\prime}_{1}|=1+\lfloor r^{\prime}/C^{*}\rfloor$ and $|A^{\prime}_{2}|=r^{\prime}-|A_{1}^{\prime}|$ , and identify $A^{\prime}_{1}$ and $A^{\prime}_{2}$ as good. Define three types of tests:

For each of $h=n^{*}+1,\dots,n$ , a test that outputs 1 if the hypothesis is $h$ and 2 otherwise. 2. 2.

For each column $c$ , a tests that outputs 1 if the hypothesis is in column $c$ and 2 otherwise. 3. 3.

For each column $c$ and $t\in[0,\log r]$ , a test that outputs 1 if the hypothesis is in column $c$ and the $t$ th digit of the row number’s binary expansion is a one, and 2 otherwise. 4. 4.

for each good set $A^{\prime}$ , a test that outputs 1 if the row of $h$ is in $A^{\prime}$ and 2 otherwise.

Let $h$ be the unknown hypothesis. There is a strategy that first checks whether $h$ is one of $n^{*}+1,\dots,n$ , for a total of at most $n-n^{*}<C^{*}$ queries. If not, the strategy identifies the column containing $h$ in at most $C^{*}$ queries using tests of type 2 and then identifies the corresponding row using tests of type 3, which takes $\log r<\log n\leq 2C^{*}$ queries. We thus need at most $4C^{*}$ queries for each hypothesis, so here

[TABLE]

The greedy strategy uses tests of type 4, trying to first find the row containing $h$ . This is because, for tests 1, 2, and 3, one answer accounts for at least $1-\frac{1}{C^{*}}$ fraction of the remaining hypotheses (it accounts at least $1-\frac{1}{C^{*}}$ fraction of the columns in the grid), and if the candidate set of rows containing $h$ is a good set $A^{\prime}$ , under the membership test for the good set $A_{1}^{\prime}$ , all answers account for less than $1-\frac{1}{C^{*}}$ fraction of the remaining hypotheses.

When $h$ is chosen uniformly at random from $1,\dots,n^{*}$ , the row containing $h$ is a uniformly random row. While there are at least $C^{*}$ candidate rows containing $h$ , each test gives at most $\mathbb{H}(2/C^{*})$ bits of information about the row containing $h$ in expectation (over the randomness of $h$ ). Since the row containing $h$ has at least $\log r=\log(n^{*}/C^{*})>\frac{1}{2}\log n$ bits of information, and the row has at most $\log C^{*}$ bits of information when there are at most $C^{*}$ candidate rows remaining, we have, by an analysis similar to Lemma 6.8, the greedy algorithm takes at least $\frac{\frac{1}{2}\log n-\log C^{*}}{\mathbb{H}(2/C^{*})}$ queries to identify the row containing $h$ on average. Hence,

[TABLE]

as desired. In the last inequality, we used that $\mathbb{H}(2x)\leq-4x\log(x)$ for $x\in[0,1]$ . ∎

C.2 Proof of Proposition 3.4

In this appendix, we use that it is NP-hard to approximate Set Cover to within a factor of $\frac{1}{2}\log n_{0}$ [Mos12].

Theorem C.1.

Let $r\in(0,1)$ . Then, for $n$ sufficiently large, approximating $\textnormal{DT}(2n^{r}\log n)$ to a factor of $\frac{1}{12}\log(n^{r})$ is NP-hard.

Proof.

We design a reduction from Set Cover to $\textnormal{DT}(2n^{r}\log n)$ . Suppose we are given a Set Cover instance with $n_{0}$ elements, $M$ sets $\{S_{i}\}$ where $S_{i}\subseteq[n_{0}]$ , and an optimal cover of size $B_{OPT}$ . In polynomial time, we construct a $\textnormal{DT}(2n^{r}\log n)$ instance on $n\leq n_{0}^{1/r}$ hypotheses such that, if $C_{OPT}$ is the optimal decision tree cost, then, for some $q$ ,

[TABLE]

The theorem follows as a $\frac{1}{2}\log n_{0}$ approximation to Set Cover is NP-hard, and here $n^{r}\leq n_{0}$ .

Let $q=\lfloor\frac{1}{r}\log n_{0}\rfloor$ . Let $\ell=\lfloor n_{0}^{1/r}/(n_{0}q+1)\rfloor$ . Identify the hypotheses by elements of $[\ell]\times([q]\times[n_{0}]\cup\{\perp\})$ . In this way, there are $n\sim n_{0}^{1/r}$ hypotheses. Let the elements of $[\ell]\times\{\perp\}$ have weight $\frac{1}{2n}+\frac{1}{2\ell}$ , and let $p_{h}=\frac{1}{2n}$ for all other hypotheses $h\in[\ell]\times[q]\times[n_{0}]$ . In this way, for $n$ sufficiently large, we have $p_{max}/p_{min}=1+\frac{n}{\ell}\leq 2n^{r}\log n$ . Create tests of the following forms:

For each $i_{1}\in[\ell],i_{2}\in[q],j\in[M]$ , define a test that outputs $1$ on hypotheses $(h_{1},h_{2},h_{3})$ if and only if $i_{1}=h_{1},i_{2}=h_{2},h_{3}\in S_{j}$ . 2. 2.

For each $i_{1}\in[\ell],i_{2}\in[q],j\in[M]$ , and $t\in[0,\log n_{0}]$ , define a binary test that outputs 1 if and only if $i_{1}=h_{1},i_{2}=h_{2},h_{3}\in S_{j}$ , and the $t$ th bit of $h_{3}$ ’s binary representation is 1. 3. 3.

For each $t\in[0,\log\ell]$ , define a binary test on $(h_{1},*)$ that outputs 1 if and only if the $t$ th bit of $h_{1}$ ’s binary representation is $1$ .

Consider a cover of $[n_{0}]$ using $B_{\text{OPT}}$ of the $M$ sets. We can define a tree that, given a hypothesis $h=(h_{1},*)$ , first determines $h_{1}$ using tests of type 3. Then, in each subtree, the set of consistent hypotheses is exactly $\{h_{3}\}\times([q]\times[n_{0}]\cup\{\perp\})$ . In each subtree, one can isolate the $\perp$ hypothesis in $qB_{OPT}$ queries, using tests of type 1, and use tests of type 2 to identify the remaining hypotheses in $1+\log n_{0}$ tests each. In each subtree, all of the hypotheses of the form $(h_{1},\perp)$ have weight at most $\frac{1}{\ell}$ and depth at most $qB_{OPT}$ , and all other hypotheses have weight $\frac{1}{2n}$ and depth at most $qB_{OPT}+1+\log n_{0}$ , so their total contribution to the cost of the tree is at most $qB_{OPT}+1+\log n_{0}$ . Assuming $n_{0}$ is sufficiently large,

[TABLE]

Now suppose we are given a solution to the $\textnormal{DT}(2n^{r}\log n)$ with cost $C_{OPT}$ . In the optimal tree, for all hypotheses $(h_{1},\perp)$ , at least $qB_{OPT}$ tests of type-1 or type-2 must appear on the root-to-leaf path of $(h_{1},\perp)$ : if not, there exists $h_{2}\in[q]$ , such that at most $B_{OPT}-1$ tests of type-1 with parameters $(h_{1},h_{2},j)$ or type-2 with parameters $(h_{1},h_{2},j,t)$ were used. By taking the indices $j$ used in these tests, there are at most $B_{OPT}-1$ sets covering $[n_{0}]$ , which is a contradiction. Thus, each hypothesis $(h_{1},\perp)$ has depth at least $qB_{OPT}$ . Since the hypotheses $(h_{1},\perp)$ account for at least half of the weight of the hypotheses, we have

[TABLE]

This completes the proof. ∎

Appendix D Rounding weights

Proposition D.1.

Suppose a Decision Tree instance has weights $p_{1},\dots,p_{n}$ and a cost function $C(\cdot)$ . Then, there exist weights $p_{1}^{\prime},\dots,p_{n}^{\prime}$ such that $\min_{i}p_{i}^{\prime}\geq\frac{1}{n(n-1)}$ and the cost function $C^{\prime}(\cdot)$ of the associated instance satisfies $|C^{\prime}(\mathcal{T})-C(\mathcal{T})|\leq 1$ for all decision trees $\mathcal{T}$ .

Proof.

Let $w_{i}^{\prime}=\max(p_{i},\frac{1}{(n-1)^{2}})$ and define $W^{\prime}=\sum_{i}w_{i}^{\prime}$ . We have $W^{\prime}\geq 1$ and $W^{\prime}\leq 1+\frac{n-1}{(n-1)^{2}}=\frac{n}{n-1}$ . Let $p_{i}^{\prime}=\frac{w_{i}^{\prime}}{W^{\prime}}$ , so that $p_{i}^{\prime}\geq\frac{1}{n(n-1)}$ for all $i$ . Hence, for all decision trees $\mathcal{T}$ ,

[TABLE]

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[ABS 15] Sanjeev Arora, Boaz Barak, and David Steurer. Subexponential algorithms for unique games and related problems. J. ACM , 62(5):42:1–42:25, 2015.
2[AH 12] Micah Adler and Brent Heeringa. Approximating optimal binary decision trees. Algorithmica , 62(3-4):1112–1121, 2012.
3[AMM + 98] Esther M Arkin, Henk Meijer, Joseph SB Mitchell, David Rappaport, and Steven S Skiena. Decision trees for geometric models. International Journal of Computational Geometry & Applications , 8(03):343–363, 1998.
4[Bab 16] László Babai. Graph isomorphism in quasipolynomial time [extended abstract]. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016 , pages 684–697, 2016.
5[Chv 79] Vasek Chvatal. A greedy heuristic for the set-covering problem. Mathematics of operations research , 4(3):233–235, 1979.
6[CJLM 10] Ferdinando Cicalese, Tobias Jacobs, Eduardo Laber, and Marco Molinaro. On greedy algorithms for decision trees. In International Symposium on Algorithms and Computation , pages 206–217. Springer, 2010.
7[CPR + 11] Venkatesan T. Chakaravarthy, Vinayaka Pandit, Sambuddha Roy, Pranjal Awasthi, and Mukesh K. Mohania. Decision trees for entity identification: Approximation algorithms and hardness results. ACM Trans. Algorithms , 7(2):15:1–15:22, 2011.
8[CPRS 09] Venkatesan T. Chakaravarthy, Vinayaka Pandit, Sambuddha Roy, and Yogish Sabharwal. Approximating decision trees with multiway branches. In Automata, Languages and Programming, 36th International Colloquium, ICALP 2009, Rhodes, Greece, July 5-12, 2009, Proceedings, Part I , pages 210–221, 2009.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A Tight Analysis of Greedy Yields Subexponential Time Approximation for Uniform Decision Tree

Abstract

1 Introduction

1.1 Our contributions

1.2 Techniques

1.3 Organization of paper

2 Preliminaries

3 Our results

3.1 Greedy algorithm

Theorem 3.1**.**

3.2 Subexponential time algorithm

Theorem 3.2**.**

3.3 Approximation ratio tightness

Proposition 3.3**.**

Proposition 3.4**.**

3.4 Decision tree with noise

4 Sketch of proof of Theorem 3.1

4.1 Uniform weights and binary tests

Defining balanced and imbalanced vertices.

Bounding the weight of balanced vertices.

Bounding the weight of imbalanced vertices.

4.2 General weights and larger KKK

5 Sketch of proof of Theorem 3.2

5.1 Algorithm

5.2 Analysis sketch

6 Proof of Theorem 3.1 for uniform weights and K=2K=2K=2

Theorem 6.1**.**

6.1 Notation

6.2 The basic argument

Lemma 6.2**.**

Proof.

6.3 More notation: Majority and minority answers

Lemma 6.3**.**

Proof.

6.4 Defining balanced and imbalanced vertices

Definition 6.4**.**

Lemma 6.5**.**

Proof.

Definition 6.6**.**

6.5 Bounding the weight of balanced vertices

Lemma 6.7**.**

Proof.

Lemma 6.8**.**

Proof.

6.6 Bounding the weight of imbalanced vertices

6.6.1 Defining Min Sum Set Cover

Definition 6.9**.**

Remark 6.10**.**

Definition 6.11**.**

Theorem 6.12** (Theorem 1 of [FLT04]).**

6.6.2 Bounding chain weight above by MSSC cost

Lemma 6.13**.**

Proof.

Corollary 6.14**.**

6.6.3 Bounding MSSC cost above by COPTC_{\textnormal{OPT}}COPT​

Lemma 6.15**.**

Proof.

6.6.4 Bounding imbalanced vertex weight above by COPTC_{\textnormal{OPT}}COPT​

Lemma 6.16**.**

Proof.

6.7 Finishing the proof

Proof of Theorem 3.1.

7 Proof of Theorem 3.2

7.1 Runtime

Lemma 7.1**.**

Proof.

Lemma 7.2**.**

Proof.

7.2 Notation

7.3 Approximation guarantee

Lemma 7.3**.**

Proof.

Lemma 7.4**.**

Theorem 3.1.

Theorem 3.2.

Proposition 3.3.

Proposition 3.4.

4.2 General weights and larger $K$

6 Proof of Theorem 3.1 for uniform weights and $K=2$

Theorem 6.1.

Lemma 6.2.

Lemma 6.3.

Definition 6.4.

Lemma 6.5.

Definition 6.6.

Lemma 6.7.

Lemma 6.8.

Definition 6.9.

Remark 6.10.

Definition 6.11.

Theorem 6.12 (Theorem 1 of [FLT04]).

Lemma 6.13.

Corollary 6.14.

6.6.3 Bounding MSSC cost above by $C_{\textnormal{OPT}}$

Lemma 6.15.

6.6.4 Bounding imbalanced vertex weight above by $C_{\textnormal{OPT}}$

Lemma 6.16.

Lemma 7.1.

Lemma 7.2.

Lemma 7.3.

Lemma 7.4.

Lemma 7.5.

Lemma 7.6.

Lemma 7.7.

Lemma A.1 (Lemma 6.2, restated).

Lemma A.2.

Definition A.3.

Lemma A.4.

Definition A.5.

Lemma A.6.

Lemma A.7.

Definition A.8.

Lemma A.9.

Definition A.10.

Definition A.11.

Remark A.12.

Definition A.13.

Theorem A.14 (Theorem 5.10 of [GK11]).

Lemma A.15.

A.6.4 Bounding WMSSC cost above by $C_{\textnormal{OPT}}$

Lemma A.16.

Lemma A.17.

Definition A.18.

Theorem A.19.

Lemma A.20.

Lemma A.21.

Theorem C.1.

Proposition D.1.