Testing Mixtures of Discrete Distributions

Maryam Aliakbarpour; Ravi Kumar; Ronitt Rubinfeld

arXiv:1907.03190·math.ST·July 9, 2019

Testing Mixtures of Discrete Distributions

Maryam Aliakbarpour, Ravi Kumar, Ronitt Rubinfeld

PDF

Open Access

TL;DR

This paper introduces a new noise model for distribution testing, where the noisy distribution is a known mixture of the original and noise, and demonstrates that testing in this setting can be as sample-efficient as in the noise-free case.

Contribution

The authors propose a tractable mixture noise model for distribution testing and show that testing complexity remains unchanged compared to classical methods.

Findings

01

Sample complexity matches classical non-mixture testing

02

Mixture testing is more tractable under the proposed noise model

03

Results apply to identity and closeness testing problems

Abstract

There has been significant study on the sample complexity of testing properties of distributions over large domains. For many properties, it is known that the sample complexity can be substantially smaller than the domain size. For example, over a domain of size $n$ , distinguishing the uniform distribution from distributions that are far from uniform in $ℓ_{1}$ -distance uses only $O (n)$ samples. However, the picture is very different in the presence of arbitrary noise, even when the amount of noise is quite small. In this case, one must distinguish if samples are coming from a distribution that is $ϵ$ -close to uniform from the case where the distribution is $(1 - ϵ)$ -far from uniform. The latter task requires nearly linear in $n$ samples [Valiant 2008, Valian and Valiant 2011]. In this work, we present a noise model that on one hand is more tractable for the…

Equations176

α^{*} = \frac{q _{1} ( S ) - p ( S )}{q _{1} ( S ) - q _{2} ( S )} \geq \frac{q _{1} ( S ) - ( w _{S} + ϵ /4 )}{q _{1} ( S ) - q _{2} ( S )} = α .

α^{*} = \frac{q _{1} ( S ) - p ( S )}{q _{1} ( S ) - q _{2} ( S )} \geq \frac{q _{1} ( S ) - ( w _{S} + ϵ /4 )}{q _{1} ( S ) - q _{2} ( S )} = α .

\begin{array}[]{ll}\|p-q_{\alpha}\|_{1}&=\sum\limits_{i}\left|p(i)-q_{\alpha}(i)\right|\enspace=\enspace\sum\limits_{i}\left|(1-\alpha^{*})q_{1}(i)+\alpha^{*}\,q_{2}(i)-(1-\alpha)q_{1}(i)-\alpha\,q_{2}(i)\right|\\ &=\sum\limits_{i}\left|(\alpha-\alpha^{*})(q_{1}(i)-q_{2}(i))\right|\enspace=\enspace\left|(\alpha-\alpha^{*})\right|\cdot\|q_{1}-q_{2}\|_{1}\vspace{2mm}\\ &=2\left|(\alpha-\alpha^{*})\right|\cdot\left(q_{1}(S)-q_{2}(S)\right)\vspace{1mm}\enspace=\enspace 2\left|p(S)-w_{S}-\frac{\epsilon}{4}\right|\enspace\leq\enspace\epsilon.\vspace*{-0.9cm}\end{array}

\begin{array}[]{ll}\|p-q_{\alpha}\|_{1}&=\sum\limits_{i}\left|p(i)-q_{\alpha}(i)\right|\enspace=\enspace\sum\limits_{i}\left|(1-\alpha^{*})q_{1}(i)+\alpha^{*}\,q_{2}(i)-(1-\alpha)q_{1}(i)-\alpha\,q_{2}(i)\right|\\ &=\sum\limits_{i}\left|(\alpha-\alpha^{*})(q_{1}(i)-q_{2}(i))\right|\enspace=\enspace\left|(\alpha-\alpha^{*})\right|\cdot\|q_{1}-q_{2}\|_{1}\vspace{2mm}\\ &=2\left|(\alpha-\alpha^{*})\right|\cdot\left(q_{1}(S)-q_{2}(S)\right)\vspace{1mm}\enspace=\enspace 2\left|p(S)-w_{S}-\frac{\epsilon}{4}\right|\enspace\leq\enspace\epsilon.\vspace*{-0.9cm}\end{array}

a_{i} : = ⌊ n q_{α} (i) ⌋ + ⌊ \frac{n ∣ q _{α} ( i ) - q _{2} ( i ) ∣}{∥ q _{α} - q _{2} ∥ _{1}} ⌋ + 1.

a_{i} : = ⌊ n q_{α} (i) ⌋ + ⌊ \frac{n ∣ q _{α} ( i ) - q _{2} ( i ) ∣}{∥ q _{α} - q _{2} ∥ _{1}} ⌋ + 1.

\begin{array}[]{ll}|D|&=\sum\limits_{i=1}^{n}a_{i}\leq\sum\limits_{i=1}^{n}\left(nq_{\alpha}(i)+\frac{n|q_{\alpha}(i)-q_{2}(i)|}{\|q_{\alpha}-q_{2}\|_{1}}+1\right)\\ &=n\left(\sum\limits_{i=1}^{n}q_{\alpha}(i)\right)+\frac{n}{\|q_{\alpha}-q_{2}\|_{1}}\left(\sum\limits_{i=1}^{n}|q_{\alpha}(i)-q_{2}(i)|\right)+n=3n\,.\end{array}

\begin{array}[]{ll}|D|&=\sum\limits_{i=1}^{n}a_{i}\leq\sum\limits_{i=1}^{n}\left(nq_{\alpha}(i)+\frac{n|q_{\alpha}(i)-q_{2}(i)|}{\|q_{\alpha}-q_{2}\|_{1}}+1\right)\\ &=n\left(\sum\limits_{i=1}^{n}q_{\alpha}(i)\right)+\frac{n}{\|q_{\alpha}-q_{2}\|_{1}}\left(\sum\limits_{i=1}^{n}|q_{\alpha}(i)-q_{2}(i)|\right)+n=3n\,.\end{array}

\begin{array}[]{ll}\left|p^{\prime}(i,j)-q^{\prime}_{\alpha}(i,j)\right|&=\frac{\left|p(i)-q_{\alpha}(i)\right|}{a_{i}}=\frac{\beta\cdot|q_{\alpha}(i)-q_{2}(i)|}{a_{i}}\leq\frac{\beta\cdot d\cdot|q_{\alpha}(i)-q_{2}(i)|}{n\cdot|q_{\alpha}(i)-q_{2}(i)|}=\frac{\beta d}{n}\\ &=\frac{1}{n}\sum\limits_{i=1}^{n}\beta|q_{\alpha}(i)-q_{2}(i)|=\frac{1}{n}\sum\limits_{i=1}^{n}|p(i)-q_{\alpha}(i)|=\frac{\|p-q_{\alpha}\|_{1}}{n}\leq\frac{\epsilon^{\prime}}{n}.\end{array}\vspace*{-.9cm}

\begin{array}[]{ll}\left|p^{\prime}(i,j)-q^{\prime}_{\alpha}(i,j)\right|&=\frac{\left|p(i)-q_{\alpha}(i)\right|}{a_{i}}=\frac{\beta\cdot|q_{\alpha}(i)-q_{2}(i)|}{a_{i}}\leq\frac{\beta\cdot d\cdot|q_{\alpha}(i)-q_{2}(i)|}{n\cdot|q_{\alpha}(i)-q_{2}(i)|}=\frac{\beta d}{n}\\ &=\frac{1}{n}\sum\limits_{i=1}^{n}\beta|q_{\alpha}(i)-q_{2}(i)|=\frac{1}{n}\sum\limits_{i=1}^{n}|p(i)-q_{\alpha}(i)|=\frac{\|p-q_{\alpha}\|_{1}}{n}\leq\frac{\epsilon^{\prime}}{n}.\end{array}\vspace*{-.9cm}

∥ p^{'} - q_{α}^{'} ∥_{2} \leq \frac{∣ D ∣ \cdot ϵ ^{'2}}{n ^{2}} \leq \frac{ϵ ^{2}}{12 n} \leq \frac{ϵ}{2 ∣ D ∣} .

∥ p^{'} - q_{α}^{'} ∥_{2} \leq \frac{∣ D ∣ \cdot ϵ ^{'2}}{n ^{2}} \leq \frac{ϵ ^{2}}{12 n} \leq \frac{ϵ}{2 ∣ D ∣} .

f (α) := i = 1 \sum n (X_{i} - (1 - α) Y_{i} - α Z_{i})^{2} - X_{i} - (1 - α)^{2} Y_{i} - α^{2} Z_{i} .

f (α) := i = 1 \sum n (X_{i} - (1 - α) Y_{i} - α Z_{i})^{2} - X_{i} - (1 - α)^{2} Y_{i} - α^{2} Z_{i} .

∥ q_{1} - p ∥_{2}^{2}

∥ q_{1} - p ∥_{2}^{2}

= α^{*}^{2} \cdot ∥ q_{1} - q_{2} ∥_{2}^{2} \leq ∥ q_{1} - q_{2} ∥_{2}^{2} \leq \frac{ϵ ^{2}}{4 n} .

A : = i = 1 \sum n (Y_{i} - Z_{i})^{2} - Z_{i} - Y_{i}, B := 2 i = 1 \sum n Y_{i} + X_{i} Y_{i} + Y_{i} Z_{i} - Y_{i}^{2} - X_{i} Z_{i}, C : = i = 1 \sum n (X_{i} - Y_{i})^{2} - X_{i} - Y_{i} .

A : = i = 1 \sum n (Y_{i} - Z_{i})^{2} - Z_{i} - Y_{i}, B := 2 i = 1 \sum n Y_{i} + X_{i} Y_{i} + Y_{i} Z_{i} - Y_{i}^{2} - X_{i} Z_{i}, C : = i = 1 \sum n (X_{i} - Y_{i})^{2} - X_{i} - Y_{i} .

E_{X, Y, Z} [A A A] = s^{2} ∥ q_{1} - q_{2} ∥_{2}^{2} \mbox an d Var_{X, Y, Z} [A A A] \leq 8 s^{3} ∥ q_{1} - q_{2} ∥_{4}^{2} b + 8 s^{2} b .

E_{X, Y, Z} [A A A] = s^{2} ∥ q_{1} - q_{2} ∥_{2}^{2} \mbox an d Var_{X, Y, Z} [A A A] \leq 8 s^{3} ∥ q_{1} - q_{2} ∥_{4}^{2} b + 8 s^{2} b .

E [R R R] = c_{1} s^{2} ∥ q_{1} - q_{2} ∥_{2}^{2}, Var [R R R] \leq c_{2} s^{3} ∥ q_{1} - q_{2} ∥_{4}^{2} b + c_{3} s^{2} b,

E [R R R] = c_{1} s^{2} ∥ q_{1} - q_{2} ∥_{2}^{2}, Var [R R R] \leq c_{2} s^{3} ∥ q_{1} - q_{2} ∥_{4}^{2} b + c_{3} s^{2} b,

A = c_{A} \cdot s^{2} ∥ q_{1} - q_{2} ∥_{2}^{2} .

A = c_{A} \cdot s^{2} ∥ q_{1} - q_{2} ∥_{2}^{2} .

B = - 2 c_{B} \cdot α^{*} ∥ q_{1} - q_{2} ∥_{2}^{2} .

B = - 2 c_{B} \cdot α^{*} ∥ q_{1} - q_{2} ∥_{2}^{2} .

E_{X, Y, Z} [f (α) f (α) f (α)] = s^{2} \cdot ∥ p - q_{α} ∥_{2}^{2} \mbox an d Var_{X, Y, Z} [f (α) f (α) f (α)] \leq 8 s^{3} \cdot b \cdot ∥ p - q_{α} ∥_{4}^{2} + 8 s^{2} \cdot b .

E_{X, Y, Z} [f (α) f (α) f (α)] = s^{2} \cdot ∥ p - q_{α} ∥_{2}^{2} \mbox an d Var_{X, Y, Z} [f (α) f (α) f (α)] \leq 8 s^{3} \cdot b \cdot ∥ p - q_{α} ∥_{4}^{2} + 8 s^{2} \cdot b .

∣ f (α^{*}) ∣ \leq s^{2} γ = T .

∣ f (α^{*}) ∣ \leq s^{2} γ = T .

\overset{α}{^}_{1} = α \in [α_{m i n}, 1], f (α) \leq T ar g min f (α), \mbox an d \overset{α}{^}_{2} = α \in [0, α_{m i n}], f (α) \leq T ar g min f (α) .

\overset{α}{^}_{1} = α \in [α_{m i n}, 1], f (α) \leq T ar g min f (α), \mbox an d \overset{α}{^}_{2} = α \in [0, α_{m i n}], f (α) \leq T ar g min f (α) .

Pr [∣ R - E [R R R] ∣ \geq c_{1} τ s^{2} ∣ R - E [R R R] ∣ \geq c_{1} τ s^{2} ∣ R - E [R R R] ∣ \geq c_{1} τ s^{2}]

Pr [∣ R - E [R R R] ∣ \geq c_{1} τ s^{2} ∣ R - E [R R R] ∣ \geq c_{1} τ s^{2} ∣ R - E [R R R] ∣ \geq c_{1} τ s^{2}]

\leq \frac{c _{2} ∥ q _{1} - q _{2} ∥ _{2}^{2} b}{c _{1}^{2} τ ^{2} s} + \frac{c _{3} b}{c _{1}^{2} τ ^{2} s ^{2}} \leq \frac{c _{2} b}{c _{1}^{2} τ s} + \frac{c _{3} b}{c _{1}^{2} τ ^{2} s ^{2}} \leq 0.01,

Pr [∣ R - E [R R R] ∣ \geq 0.1 \cdot E [R R R] ∣ R - E [R R R] ∣ \geq 0.1 \cdot E [R R R] ∣ R - E [R R R] ∣ \geq 0.1 \cdot E [R R R]]

Pr [∣ R - E [R R R] ∣ \geq 0.1 \cdot E [R R R] ∣ R - E [R R R] ∣ \geq 0.1 \cdot E [R R R] ∣ R - E [R R R] ∣ \geq 0.1 \cdot E [R R R]]

\leq \frac{100 c _{2} b}{c _{1}^{2} s ∥ q _{1} - q _{2} ∥ _{2}^{2}} + \frac{100 c _{3} b}{c _{1}^{2} s ^{2} ∥ q _{1} - q _{2} ∥ _{2}^{4}}

\leq \frac{100 c _{2} b}{c _{1}^{2} τ s} + \frac{100 c _{3} b}{c _{1}^{2} τ ^{2} s ^{2}} \leq 0.01,

B_{i} : = Y_{i} + X_{i} Y_{i} + Y_{i} Z_{i} - Y_{i}^{2} - X_{i} Z_{i} .

B_{i} : = Y_{i} + X_{i} Y_{i} + Y_{i} Z_{i} - Y_{i}^{2} - X_{i} Z_{i} .

E_{X, Y, Z} [B_{i} B_{i} B_{i}]

E_{X, Y, Z} [B_{i} B_{i} B_{i}]

= - α^{*} s^{2} (q_{1} (i) - q_{2} (i))^{2} .

E_{X, Y, Z} [- B - B - B] = - i = 1 \sum n 2 B_{i} = i = 1 \sum n 2 α^{*} s^{2} (q_{1} (i) - q_{2} (i))^{2} = 2 α^{*} s^{2} ∥ q_{1} - q_{2} ∥_{2}^{2},

E_{X, Y, Z} [- B - B - B] = - i = 1 \sum n 2 B_{i} = i = 1 \sum n 2 α^{*} s^{2} (q_{1} (i) - q_{2} (i))^{2} = 2 α^{*} s^{2} ∥ q_{1} - q_{2} ∥_{2}^{2},

Var_{X, Y, Z} [B_{i} B_{i} B_{i}]

Var_{X, Y, Z} [B_{i} B_{i} B_{i}]

= s^{3} α^{*} (1 + α^{*}) (q_{1} (i) - q_{2} (i))^{2} (q_{1} (i) + q_{2} (i)) + 2 s^{3} (q_{1} (i) - q_{2} (i))^{2} q_{1} (i)

+ s^{2} (α^{*} (q_{2} (i)^{2} - q_{1} (i)^{2}) + q_{1} (i) (3 q_{1} (i) + 2 q_{2} (i)))

\leq 4 s^{3} (q_{1} (i) + q_{2} (i)) (q_{1} (i) - q_{2} (i))^{2} + s^{2} (q_{1} (i) + q_{2} (i))^{2} + 3 s^{2} q_{1} (i)^{2} .

Var_{X, Y, Z} [B B B]

Var_{X, Y, Z} [B B B]

\leq 16 s^{3} i = 1 \sum n (q_{1} (i) + q_{2} (i)) (q_{1} (i) - q_{2} (i))^{2} + 8∥ q_{2} ∥_{2}^{2} + 20 s^{2} ∥ q_{1} ∥_{2}^{2}

\leq 16 s^{3} (i = 1 \sum n (q_{1} (i) - q_{2} (i))^{4}) \cdot (i = 1 \sum n (q_{1} (i) + q_{2} (i))^{2}) + 28 s^{2} b

\leq 32 s^{3} ∥ q_{1} - q_{2} ∥_{4}^{2} b + 28 s^{2} b .

- B = 2 c_{B} α^{*} ∥ q_{1} - q_{2} ∥_{2}^{2},

- B = 2 c_{B} α^{*} ∥ q_{1} - q_{2} ∥_{2}^{2},

f (α) = i = 1 \sum n (X_{i} - (1 - α) Y_{i} - α Z_{i})^{2} - X_{i} - (1 - α)^{2} Y_{i} - α^{2} Z_{i} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplexity and Algorithms in Graphs · Machine Learning and Algorithms · Cryptography and Data Security

Full text

Testing Mixtures of Discrete Distributions

Maryam Aliakbarpour

CSAIL, MIT

[email protected] MA is supported by funds from the MIT-IBM Watson AI Lab (Agreement No. W1771646), the NSF grants IIS-1741137, and CCF-1733808.

Ravi Kumar

Google

[email protected]

Ronitt Rubinfeld

CSAIL, MIT, TAU

[email protected] RR is supported by funds from the MIT-IBM Watson AI Lab (Agreement No. W1771646) the NSF grants CCF-1650733, CCF-1733808, IIS-1741137 and CCF-1740751.

Abstract

There has been significant study on the sample complexity of testing properties of distributions over large domains. For many properties, it is known that the sample complexity can be substantially smaller than the domain size. For example, over a domain of size $n$ , distinguishing the uniform distribution from distributions that are far from uniform in $\ell_{1}$ -distance uses only $O(\sqrt{n})$ samples.

However, the picture is very different in the presence of arbitrary noise, even when the amount of noise is quite small. In this case, one must distinguish if samples are coming from a distribution that is $\epsilon$ -close to uniform from the case where the distribution is $(1-\epsilon)$ -far from uniform. The latter task requires nearly linear in $n$ samples [Val08, VV17b].

In this work, we present a noise model that on one hand is more tractable for the testing problem, and on the other hand represents a rich class of noise families. In our model, the noisy distribution is a mixture of the original distribution and noise, where the latter is known to the tester either explicitly or via sample access; the form of the noise is also known a priori. Focusing on the identity and closeness testing problems leads to the following mixture testing question: Given samples of distributions $p,q_{1},q_{2}$ , can we test if $p$ is a mixture of $q_{1}$ and $q_{2}$ ? We consider this general question in various scenarios that differ in terms of how the tester can access the distributions, and show that indeed this problem is more tractable. Our results show that the sample complexity of our testers are exactly the same as for the classical non-mixture case.

1 Introduction

Distribution testing [BFR*+*13] has been studied extensively for the past many years (see [Can15] for a survey). In the vanilla version, the problem is to quickly test if a discrete distribution has a certain property or is statistically far from any distribution with that property. The tester has access to samples from the distribution and strives to be as frugal as possible in the number of samples it uses. Many statistical properties, including various distances between distributions, are well understood in this model. There have been several relaxations to the basic testing model including tolerant testing (where the tester should also accept if the distribution is close to having a property), the conditional samples model (where the tester can access the distribution conditioned on a specified subset), making stylized assumptions about the distribution (monotone, sparse support, high-dimensional, etc), and so on. In each of these works, the aim has been to push the boundaries of our understanding: when do sample-efficient testers exist? Here, by sample-efficient, we mean the number of samples should be sub-linear in the domain size.

There are many scenarios in which a distribution is observed along with noise; in some cases, even the form of the noise is known a priori. One such scenario is the so-called identity testing problem in which the tester has a known (explicitly specified) distribution and its goal is to check if a given distribution, available as samples, is close to the known distribution. For example, assume that the distribution of the top million queries to a web search engine is known in advance. Then, identity testing would be a quick way to check how close the daily query distribution is to this known distribution. However, in reality, there are natural minor variations to the daily query distribution, which may cause the identity tester to fail. This is clearly undesirable.

An option to tackle the noise would be to use testers that are tolerant to noise. Unfortunately, even simple versions of tolerant testers are faced with near-linear lower bounds on the sample complexity, making this option uninteresting. For example, one can distinguish if a distribution on a domain of size $n$ is uniform or far from uniform in $\ell_{1}$ -distance using $\Theta(\sqrt{n})$ samples [Pan08]. However, an algorithm that distinguishes between near-uniform distributions and distributions that are far from uniform requires $\Omega(n/\log n)$ samples [Val08, VV17a]. Hence, to achieve sub-linear sample complexity, we need more judicious, stylized assumptions about the noise—how it is available to the tester and if it is adversarial.

A different yet natural way to model the above scenario is to view it as a mixture of distributions. In the above example, one of the components of the mixture can be interpreted as the signal and the other component can be thought of as the noise. More generally, the tester is given the components of a mixture of two distributions. However, it does not know the mixing parameter, i.e., the magnitude of the contribution of each component to the mixture. The mixture testing problem is then to test if a distribution is close to a mixture of two given distributions or is far from any manner in which the two distributions can be mixed. As we will see, by making reasonable assumptions on the form of the noise and how it is available to the tester, the tolerant testing lower bounds can be circumvented and one can obtain testers with sub-linear sample complexity.

Main contributions.

In this work, we consider distribution testing of mixtures of two distributions $q_{1}$ and $q_{2}$ . For ease of exposition, let us call the first component $q_{1}$ the original distribution and the second component $q_{2}$ the noise, and let $[n]=\{1,\ldots,n\}$ be the domain of both $q_{1}$ and $q_{2}$ . The simplest version of our problem is: given sample access to distribution $p$ , and for known distributions $q_{1},q_{2}$ , is $p=\alpha q_{1}+(1-\alpha)q_{2}$ for some $\alpha$ , or is $p$ far from $\alpha q_{1}+(1-\alpha)q_{2}$ for every $\alpha\in[0,1]$ ? Note that the tester is not given the mixture parameter $\alpha$ . We further study the case when $q_{1},q_{2}$ are not given explicitly to the algorithm, as well as other generalizations.

We mainly focus on identity and closeness testing, which are two basic instances of hypothesis testing that have received much attention in the theory, machine learning, and statistics communities; see the works of [GGR98, BFF*+*01, BFR*+*13, Bat01, BDKR02, BKR04, Pan08, Val08, GR11, ILR12, LRR13, DDS*+*13, AJOS14, CDVV14, FJO*+*15, DKN15b, DKN15a, ADK15, ABR16, CDGR16, CDKS17, DP17, VV17a, DKS18, SDC18, BH18] and the surveys of [Rub12] and [Can15].

The mixture testing problem has a more constrained model compared to the tolerant testing problem, so one might hope to bypass the existing lower bounds. However, the mixture testing problem can also run into near-linear sample complexity lower bounds if one does not provide the tester with sufficient access to the mixture components. Indeed, if the tester does not have access to the noise, we show the mixture testing problem becomes as hard as tolerant testing, necessitating $\Omega(n/\log n)$ samples (Theorem 3.4). Hence, to show nontrivial positive results, the tester must have access to some kind of information about the noise. We consider the following three cases for the noise, namely, (i) when the noise is given as an explicitly specified distribution, (ii) when the tester does not explicitly know the noise distribution, but does have sample access to it, and (iii) when there is no explicit description or access to samples from the noise distribution, but it is known that the noise distribution comes from a class of distributions, e.g., the set of $k$ -histogram distributions. For the first, we obtain a tester with sample complexity $\Theta(\sqrt{n}/\epsilon^{2})$ and for the second, we obtain a tester with sample complexity $\Theta(\sqrt{n}/\epsilon^{2}+n^{2/3}/\epsilon^{4/3})$ where $\epsilon$ is a given proximity parameter; these show that the complexity of our testers is exactly the same as for the classical non-mixture case. For the third, when the noise is assumed to come from the set of $k$ -histogram distributions, we obtain an identity tester that uses $\tilde{O}(\sqrt{kn})$ samples.

2 Preliminaries

For the rest of the paper, we use the following notation. For a distribution $p$ over $[n]$ , we use $p(i)$ to denote the probability of element $i\in[n]$ and for a subset $S\subseteq[n]$ , let $p(S)=\sum_{i\in S}p(i)$ . We use $\|.\|_{p}$ to indicate the $\ell_{p}$ -norm of a vector. We typically use the $\ell_{1}$ -distance and say $p$ and $q$ are $\epsilon$ -close if $\|p-q\|_{1}<\epsilon$ and $\epsilon$ -far otherwise. Let $\mathcal{U}_{n}$ denote the uniform distribution on $[n]$ ; we drop the subscript when the domain is clear from the context. Distribution $p$ is a mixture of $q_{1}$ and $q_{2}$ if there exists $\alpha\in[0,1]$ such that $p=(1-\alpha)\,q_{1}+\alpha q_{2}$ . We call $\alpha$ the mixture parameter. We use $q_{\alpha}$ to denote the mixture $(1-\alpha)q_{1}+\alpha q_{2}$ when the components $q_{1}$ and $q_{2}$ are clear from the context.

Background.

Through this paper we consider several distribution testing problems: For a given property of distributions, we use $\Pi$ to denote a set of distributions that satisfy the property. The distance of distribution $p$ to $\Pi$ is the $\ell_{1}$ -distance between $p$ and the closest distribution $q$ in $\Pi$ . In a distribution testing problem, the goal is to distinguish whether $p$ is in $\Pi$ or is $\epsilon$ -far from $\Pi$ . We say an algorithm is a tester for property $\Pi$ if the following is true with probability $2/3$ . 111The success probability of $2/3$ is arbitrary here. Given such tester, we can achieve a success probability of $1-\delta$ , via standard amplification methods, at the cost of a $\log(1/\delta)$ multiplicative increase in the sample complexity.

nosep

Completeness: If $p$ is in $\Pi$ , then the algorithm outputs accept.

nosep

Soundness: If $p$ is $\epsilon$ -far from $\Pi$ , the algorithm outputs reject.

The algorithm is an $(\epsilon^{\prime},\epsilon)$ -tolerant tester, if it also satisfies the stronger completeness property that when $p$ is $\epsilon^{\prime}$ -close to some distribution in $\Pi$ , then the algorithm outputs accept (with probability at least $2/3$ ). These definitions can be extended to the case of properties of collections of more than one distribution. Although in the standard setting we receive samples from at least one distribution in the collection, the testing problems may be defined with respect to other methods of access.

We make one of the three following assumptions regarding the algorithm’s view of the distributions: (i) The distribution is explicitly given or known if the algorithm knows the probability of each domain element under the distribution. (ii) The distribution is given by samples if the algorithm has access to an oracle that provides samples from the distribution. (iii) The distribution is not known nor given by samples but is a member of a given class of distributions.

The term identity testing is used to refer to the setting in which we test if a distribution, which we have sample access to, is equal to a known one. Note that this is equivalent to testing property $\Pi=\{q\}$ . The term *closeness testing * refers to the setting in which we test if two distributions, both available via samples, are equal or not; in this case, $\Pi$ is the set of pairs of equal distributions.

Mixture testing problems.

Suppose $p$ , $q_{1}$ , and $q_{2}$ are distributions over $[n]$ . Let $\Pi_{q_{1},q_{2}}\coloneqq\{(1-\alpha)\,q_{1}+\alpha\,q_{2}\leavevmode\nobreak\ \mid\leavevmode\nobreak\ \alpha\in[0,1]\}$ (we usually drop the subscripts $q_{1},q_{2}$ when they are clear from context). In a mixture testing problem, the goal is to distinguish whether a distribution $p$ given via samples is in $\Pi_{q_{1},q_{2}}$ or $\epsilon$ -far from any distribution in $\Pi_{q_{1},q_{2}}$ with probability at least 2/3. We investigate the following problems, which differ in the way that mixture testing algorithm can access $q_{1},q_{2}$ . Note however that the mixture parameter $\alpha$ is not given to the tester. (i) An algorithm is an identity tester in the presence of known noise if it solves the mixture testing problem when $q_{1},q_{2}$ are known to the tester. (ii) An algorithm is a *closeness tester in the presence of noise that is accessible via samples * if it solves the mixture testing problem when $q_{1},q_{2}$ are not explicitly given, but samples of each are provided to the tester. (iii) An algorithm is an identity tester in the presence of class $\mathcal{C}$ -noise if it can distinguish whether $p$ is a mixture of a known distribution $q_{1}$ and some $q_{2}\in\mathcal{C}$ . Note that such an algorithm is a property tester for $\Pi\coloneqq\{(1-\alpha)\,q_{1}+\alpha\,q_{2}\leavevmode\nobreak\ \mid\leavevmode\nobreak\ q_{2}\in\mathcal{C}\mbox{, and }\alpha\in[0,1]\}$ .

Note that one can also define “closeness testing in the presence of known noise”, and “identity testing in the presence of noise that is given via samples”, but our lower bounds will show that the sample complexity of these tasks is the same as the sample complexity of closeness testing in the presence of noise that is given via samples.

3 An overview of our results and techniques

3.1 Testing identity in the presence of known noise

We first consider the problem of testing if distribution $p$ , given via samples, can be expressed as mixture of known distributions $q_{1}$ and $q_{2}$ . We show the following.

Theorem 3.1.

*Given two known distributions $q_{1}$ , $q_{2}$ , and $\epsilon>0$ , there is an identity tester in the presence of known noise that uses $O(\sqrt{n}/\epsilon^{2})$ samples. Furthermore, $\Omega(\sqrt{n}/\epsilon^{2})$ samples are required. *

At a high level, we take the following steps to test if $p$ is a mixture or $\epsilon$ -far from it. First, we develop an algorithm (learner) to learn mixture distributions. The learner receives samples from $p$ and outputs a mixture distribution $q_{\alpha}$ . If $p$ is a mixture, then we show that the learner finds a mixture distribution $q_{\alpha}$ that is $\epsilon^{\prime}$ -close to $p$ for some proximity parameter $\epsilon^{\prime}\coloneqq\Theta(\epsilon)$ ; and if $p$ is not a mixture, the learner outputs $q_{\alpha}$ with no specific guarantee. Second, we use the distance between $p$ and $q_{\alpha}$ as a measure to decide about $p$ : if $p$ is $\epsilon^{\prime}$ -close to $q_{\alpha}$ , we accept $p$ ; and if $p$ is $\epsilon$ -far from $q_{\alpha}$ , we reject it. This approach results in a tester for $p$ . In fact, if $p$ is a mixture, then we show that the learner finds a $q_{\alpha}$ that is $\epsilon^{\prime}$ -close and if $p$ is $\epsilon$ -far from being a mixture, then we show that $p$ has to be $\epsilon$ -far from any mixture distribution, including $q_{\alpha}$ .

The challenge in this approach is to distinguish whether $p$ is close to $q_{\alpha}$ or far from it. In general, testing whether two distributions are $\epsilon^{\prime}$ -close or $\epsilon$ -far from each other requires ${\Omega}(n/\log n)$ samples. However, we show that we can exploit the structural properties of mixture distributions to achieve a sample-efficient algorithm. Below we provide a more detailed description of the steps.

The learner.

The algorithm begins by assuming that the given distribution $p$ is indeed a mixture, and attempts to learn the mixture parameter: If $p$ is a mixture, then we show that it can be learned to error $\epsilon^{\prime}=\Theta(\epsilon)$ using $O(1/\epsilon^{2})$ samples given $q_{1}$ and $q_{2}$ . The algorithm picks a subset $S$ of elements such that it contains every element $i$ for which $q_{1}(i)\geq q_{2}(i)$ and estimates the weight of these elements according to $p$ , i.e., $p(S)$ . $S$ satisfies that $q_{1}(S)-q_{2}(S)$ is exactly the total variation distance between $q_{1}$ and $q_{2}$ . Comparing $p(S)$ with the weight of these elements according to $q_{1}$ and $q_{2}$ guides us to choose a mixture parameter $\alpha$ , and allows us to bound the distance between $p$ and $q_{\alpha}\coloneqq(1-\alpha)q_{1}+\alpha q_{2}$ . (Instead of learning $\alpha$ , one might do a grid search on $\alpha\in[0,1]$ ; however the granularity required could make the resulting algorithm sub-optimal.)

Assessing the distance between $p$ and $q_{\alpha}$ .

After obtaining $q_{\alpha}$ , the task of distinguishing whether distribution $p$ is a mixture or $\epsilon$ -far from a mixture boils down to testing if $p$ is $\epsilon^{\prime}$ -close to $q_{\alpha}$ or is $\epsilon$ -far from it. We propose a scheme to reshape the distributions $p$ and $q_{\alpha}$ and get two new distributions $p^{\prime}$ and $q^{\prime}_{\alpha}$ such that for $p$ that is a mixture, the $\ell_{2}$ -distance between $p^{\prime}$ and $q^{\prime}_{\alpha}$ is at most $O(\epsilon/\sqrt{n})$ . Furthermore, in the case where $p$ is $\epsilon$ -far from being a mixture, $p^{\prime}$ is $\epsilon$ -far from $q^{\prime}_{\alpha}$ . It is known that one can efficiently distinguish the case that $\|p^{\prime}-q^{\prime}_{\alpha}\|_{2}\leq O(\epsilon/\sqrt{n})$ versus $\|p^{\prime}-q^{\prime}_{\alpha}\|_{1}\geq\epsilon$ using $O(\sqrt{n}/\epsilon^{2})$ samples [DK16, CDVV14].

Here, we elaborate further on how we reshape the distributions. Similar techniques have been used previously to reduce the $\ell_{2}$ -norm, e.g., in [DK16]. Here, we use it to bound the $\ell_{2}$ -distance between $p^{\prime}$ and $q^{\prime}_{\alpha}$ . The reshaping process is as follows. Define $p^{\prime}$ , the reshaped distribution of $p$ with a new domain which is larger than the domain of $p$ . For each element $i$ , we determine an integer $a_{i}$ solely based on $q_{1}$ , $q_{2}$ , and $q_{\alpha}$ . Then we add $(i,j)$ for all $j$ in $[a_{i}]$ to the domain of $p^{\prime}$ . We set the probability of element $(i,j)$ to be $p(i)/a_{i}$ . Also, we reshape $q_{\alpha}$ according to the same process and get $q^{\prime}_{\alpha}$ .

But how can reshaping reduce the $\ell_{2}$ -distance? Given that $p$ is a mixture, for each element $i$ in the domain, the discrepancy between the probability of $i$ according to $p^{\prime}$ and $q^{\prime}_{\alpha}$ , $|p(i)-q_{\alpha}(i)|/a_{i}$ , is proportional to $|q_{1}(i)-q_{2}(i)|/a_{i}$ . With this observation, we set the $a_{i}$ ’s such that they make the discrepancy $O(\epsilon/n)$ for each element. This ensures the $\ell_{2}$ -distance between $p^{\prime}$ and $q^{\prime}_{\alpha}$ is $O(\epsilon/\sqrt{n})$ .

The arguments described above are formalized in Theorem 4.3. In addition, in the case where $q_{1}$ and $q_{2}$ are uniform, this problem is as hard as testing if a distribution is uniform, which needs at least $\Omega(\sqrt{n}/\epsilon^{2})$ samples ([Pan08]), showing that the sample complexity of our algorithm is tight. Furthermore, we match the sample complexity of the standard identity tester where there is no noise involved.

3.2 Testing closeness in the presence of noise

that is accessible via samples

We next investigate the problem of testing closeness of distributions in the presence of noise that is accessible via samples. Suppose we have sample access to three distributions, $p$ , $q_{1}$ , and $q_{2}$ , over $[n]$ . The goal is to test if there is a mixture parameter $\alpha^{*}$ such that $p=(1-\alpha^{*})\,q_{1}+\alpha^{*}q_{2}$ , or $p$ is $\epsilon$ -far from any distribution in this form.

Similarly to the identity testing algorithm explained earlier, our approach is first attempt to learn $p$ . That is, we design an algorithm that finds a candidate mixture distribution, $q_{\alpha}\coloneqq(1-\alpha)q_{1}+\alpha\,q_{2}$ , such that if $p$ is a mixture of $q_{1}$ and $q_{2}$ , then $p$ and $q_{\alpha}$ will be $(\epsilon/\sqrt{n})$ -close to $p$ in $\ell_{2}$ -distance; and if $p$ is not a mixture, the algorithm finds a distribution $q_{\alpha}$ with no specific guarantees. Then, we test to see if $p$ is $(\epsilon/(2\sqrt{n}))$ -close to $q_{\alpha}$ in $\ell_{2}$ -distance, or $(\epsilon/\sqrt{n})$ -far from it. The answer of the test dictates if we should accept or reject $p$ . Indeed, if $p$ is a mixture distribution, $q_{\alpha}$ is very close to $p$ , and the test will accept $p$ . If $p$ is $\epsilon$ -far from being a mixture, then $p$ is $\epsilon$ -far from $q_{\alpha}$ , and furthermore $p$ and $q_{\alpha}$ are $(\epsilon/\sqrt{n})$ -far from each other in $\ell_{2}$ -distance, so that the test will reject $p$ .

But how do we learn $p$ ? Since we are looking for $q_{\alpha}$ , which is close to $p$ in $\ell_{2}$ -distance, we study the problem of estimating the $\ell_{2}$ -distance between $p$ and a mixture distribution of $q_{1}$ and $q_{2}$ . Inspired by the $\ell_{2}$ -distance estimator proposed by [CDVV14], we propose a statistic such that given $\alpha$ it estimates the $\ell_{2}$ -distance between $p$ and $q_{\alpha}$ : $f(\alpha):=\sum\limits_{i=1}^{n}(X_{i}-(1-\alpha)Y_{i}-\alpha Z_{i})^{2}-X_{i}-(1-\alpha)^{2}Y_{i}-{\alpha}^{2}Z_{i},$ where $X_{i}$ , $Y_{i}$ , and $Z_{i}$ are the number of instances of element $i$ among samples from $p$ , $q_{1}$ , and $q_{2}$ respectively. The statistic is designed such that it is equal to $s^{2}\|p-q(\alpha)\|_{2}^{2}$ in expectation where $s$ is the number of samples from each distribution $p$ , $q_{1}$ , and $q_{2}$ .

Given the sample sets, the goal is to use the quadratic function $f$ to find a candidate $\alpha$ . For now, assume $p$ is a mixture of $q_{1}$ and $q_{2}$ with parameter $\alpha^{*}$ . We make two observations about $f(\alpha^{*})$ : (i) the expectation of $f(\alpha)$ is minimum, in fact zero, when $\alpha=\alpha^{*}$ , and (ii) we provide a threshold $T$ for which $|f(\alpha^{*})|$ is at most $T$ with high probability. Although $\alpha^{*}$ is not given to the algorithm, we wish to pick a candidate $\alpha$ that is very close to $\alpha^{*}$ . We use the above two observations as a guide to take the following strategy: find $\alpha$ that minimizes $f(\alpha)$ while $|f(\alpha)|$ is at most $T$ . This method apparently finds several candidate $\alpha$ ’s. We establish that if $p$ is a mixture, then one of the candidate $\alpha$ ’s will result in a mixture distribution $q_{\alpha}$ that is $(\epsilon/2\sqrt{n})$ -close to $p$ in $\ell_{2}$ distance. (Once again, a grid search on $\alpha\in[0,1]$ will not yield an optimal sample complexity.)

From there on, we only need to test if any of the candidates we found are $(\epsilon/2\sqrt{n})$ -close to $p$ or not. If $p$ is a mixture we are promised that one of the candidates will pass the test. Otherwise we show that all candidates have to give distributions that are $\epsilon$ -far (implying $(\epsilon/\sqrt{n})$ -far in $\ell_{2}$ -distance) from $p$ by definition, so all of them will fail. Our approach yields the following result:

Theorem 3.2.

*Assume we have sample access to three distributions $p$ , $q_{1}$ , and $q_{2}$ over $[n]$ . There exists a closeness tester in the presence of noise that uses $O(\sqrt{n}/\epsilon^{2}+n^{2/3}/\epsilon^{4/3})$ samples. Furthermore $\Omega(\sqrt{n}/\epsilon^{2}+n^{2/3}/\epsilon^{4/3})$ samples are required. *

See Theorem 5.6 for the formal statement of the result. For the lower bound of sample complexity, we establish that the lower bound for standard closeness testers holds in the mixture setting as well, even in the case where $q_{1}$ or $q_{2}$ is known. In particular, we show given sample access to $p$ and $q_{1}$ , testing whether $p$ is a mixture of $q_{1}$ and the uniform distribution requires $\Omega(\sqrt{n}/\epsilon^{2}+n^{2/3}/\epsilon^{4/3})$ samples (Proposition 1). Hence, one cannot hope to achieve a better sample complexity.

3.3 Testing identity in the presence of $k$ -flat noise

On the one hand, the sample complexity of distribution testing under arbitrary noise is significantly worse than that of noise-free distribution testing. On the other hand, we have seen that the sample complexity of distribution testing with noise (either known or given via sample access) is very similar to the sample complexity of noise-free distribution testing. This raises the question of whether one can relax the requirement of the access to the noise by the tester and still achieve better sample complexity. The next problem we consider is the identity testing problem when there is no direct access to the noise (either via samples, or an explicit description) except for the promise that the noise comes from a class, in particular, the class of $k$ -flat distributions.

We say a distribution is $k$ -flat if the probability mass function of the distribution is a piece-wise constant function with $k$ pieces. We investigate the following problem: given a known distribution $q$ and having sample access to $p$ , can we distinguish if $p$ is a mixture of $q$ and some $k$ -flat distribution, or $p$ is $\epsilon$ -far from any such distribution? We provide an algorithm that uses $\tilde{O}(\sqrt{kn})$ samples.

Inspired by the identity tester proposed in [BFF*+*01], we propose the following approach. First, we guess the $k$ intervals on which the noise is constant. Then, we take the elements of each interval and further partition them into subsets (not necessarily contiguous) such that in each subset the probability of the elements according to $q$ are very similar to each other (similar enough so that we can show that $q$ is nearly-uniform on each subset). For a mixture distribution $p$ , if we have guessed the intervals correctly, $p$ is almost uniform within each subset since it is a mixture of an almost uniform $q$ and a constant function (noise). Hence to see if $p$ is a mixture, we first test each of these subsets and see if $p$ is close to uniform on them. We then estimate the total weights that $p$ assigns to each of these subsets and determine if the weights are consistent with a mixture of $q$ and some $k$ -flat distribution. One challenge is to find a sampling method that guarantees good results for all initial guesses of the $k$ intervals describing the noise. See Section 6 for details.

It is not hard to see that in the case where $q$ is uniform, identity testing in the presence of $k$ -flat noise is as hard as testing whether a distribution is $k$ -flat or $\epsilon$ -far from any such distribution. Therefore, there exists a lower bound of $\Omega(\sqrt{n}/\epsilon^{2}+k/(\epsilon\,\log k))$ for our problem derived from the lower bound for testing $k$ -flat distributions in [Can16].

3.4 Lower bounds

We show that testing identity with respect to the uniform distribution when the noise component can be an arbitrary distribution, requires near-linear samples, i.e., $\Omega(n/\log n)$ . More specifically,

{restatable*}

theoremLBBigness Assume $p$ is a distribution on $[n]$ . There exists a constant parameter $\epsilon$ such that distinguishing the following cases with probability at least $2/3$ requires $\Omega(n/\log n)$ samples.

nosep

There exists a noise distribution on $[n]$ , namely $\eta$ , and an $\alpha\leq\epsilon_{1}$ such that $p$ is a mixture of uniform and $\eta$ with parameter $\alpha$ , i.e., $p=(1-\alpha)\,\mathcal{U}+\alpha\,\eta$ .

nosep

There is no noise distribution $\eta$ such that $p=(1-\alpha)\,\mathcal{U}+\alpha\,\eta$ unless $\alpha=1$ .

The main idea is to reduce this problem to the that of testing the $T$ -bigness property [AGP*+*19], which holds if all probabilities are above a given threshold $T$ .

4 Identity testing of mixtures in the presence of known noise

In this section, we give an algorithm which tests if $p$ is close to a mixture where both the components are explicitly known. As before, we assume the mixture parameter $\alpha$ is unknown.

The main idea is attempt to learn a mixture distribution $q_{\alpha}$ that is close to $p$ . Using $q_{1}$ , $q_{2}$ , and $q_{\alpha}$ , we then reshape the distribution $p$ to another distribution $p^{\prime}$ and use the same reshaping to transform $q_{\alpha}$ to $q^{\prime}_{\alpha}$ . The reshaping has the property that in the case that $p$ is indeed a mixture, then $p^{\prime}$ and $q^{\prime}_{\alpha}$ will be extremely close to each other in $\ell_{2}$ -distance and if $p$ is not a mixture, then $p^{\prime}$ and $q^{\prime}_{\alpha}$ will be quite far from each other. Thus we can use a (non-tolerant) identity tester on $p^{\prime}$ and $q^{\prime}_{\alpha}$ .

In the rest of this section, we present the three main steps of the testing algorithm. The first step is a learner algorithm that finds $q_{\alpha}$ (Section 4.1). The second step is a reshaping process that transforms the distributions $p$ and $q_{\alpha}$ into $p^{\prime}$ and $q^{\prime}_{\alpha}$ respectively (Section 4.2). The third step is to put these pieces together to get the identity tester (Section 4.3).

4.1 The learner

At a high level, the leaner proceeds as follows. Observe that if $p$ is a mixture of $q_{1}$ and $q_{2}$ , then there is a parameter $\alpha^{*}\in[0,1]$ such that $p=(1-\alpha^{*})q_{1}+\alpha^{*}q_{2}$ , so to learn $p$ it is sufficient to learn $\alpha^{*}$ . Let $S$ be the set of all domain elements, $x$ , where $q_{1}(x)$ is at least $q_{2}(x)$ . By definition, for $S\subseteq[n]$ , we have $p(S)=(1-\alpha^{*})q_{1}(S)+\alpha^{*}q_{2}(S)$ , which leads to $\alpha^{*}=\left(q_{1}(S)-p(S)\right)/\left(q_{1}(S)-q_{2}(S)\right)$ . The idea then is to replace $p(S)$ with its estimate, say, $w_{S}$ to get an estimate $\alpha$ of $\alpha^{*}$ . We formally describe the procedure in Algorithm 1 and prove its correctness in Lemma 4.1.

Lemma 4.1.

Suppose $p=(1-\alpha^{*})q_{1}+\alpha^{*}q_{2}$ . Using $O(1/\epsilon^{2})$ samples, Algorithm 1 outputs a mixture parameter $\alpha\leq\alpha^{*}$ such that $\mathrm{\mathbf{Pr}}\boldsymbol{\left[\vphantom{\|q_{\alpha}-p\|_{1}<\epsilon}\right.}{\|q_{\alpha}-p\|_{1}<\epsilon}\boldsymbol{\left.\vphantom{\|q_{\alpha}-p\|_{1}<\epsilon}\right]}\geq 5/6$ .

Proof.

First, observe that if $q_{1}$ and $q_{2}$ are $\epsilon$ -close, then $p$ is $\epsilon$ -close to $q_{1}$ as well, so distribution $q_{1}$ which is a mixture with parameter $\alpha=0$ is a valid output. For the remainder of the proof, we assume $q_{1}$ and $q_{2}$ are $\epsilon$ -far from each other.

Let $S=\{i\in[n]\leavevmode\nobreak\ \mid\leavevmode\nobreak\ q_{1}(i)>q_{2}(i)\}$ . We have $\alpha^{*}=\frac{q_{1}(S)-p(S)}{q_{1}(S)-q_{2}(S)}.$ The only unknown value in the above expression is $p(S)$ , which we estimate using $O(1/\epsilon^{2})$ samples from $p$ . We show by replacing $p(S)$ , we get a viable estimate for $\alpha^{*}$ .

Let $w_{S}$ be the estimate that is the ratio of the samples that are in $S$ . By the Hoeffding bound, we have $\mathrm{\mathbf{Pr}}\boldsymbol{\left[\vphantom{|p(S)-w_{s}|\leq\epsilon/4}\right.}{|p(S)-w_{s}|\leq\epsilon/4}\boldsymbol{\left.\vphantom{|p(S)-w_{s}|\leq\epsilon/4}\right]}\geq 5/6.$ We define our estimate of $\alpha^{*}$ as: $\alpha\coloneqq\frac{q_{1}(S)-(w_{S}+\epsilon/4)}{q_{1}(S)-q_{2}(S)}\,.$ The reason that we add $\epsilon/4$ to $w_{S}$ is to assure an overestimation of $p(S)$ , so $\alpha$ becomes smaller than $\alpha^{*}$ with high probability. I.e., since with high probability $p(S)\leq w_{S}+\epsilon/4$ , we get:

[TABLE]

Below, we show $q_{\alpha}$ is close to $p$ in $\ell_{1}$ -distance. Based on the definition of $S$ , $q_{1}(S)-q_{2}(S)$ is equal to the total variation distance (i.e., half of the $\ell_{1}$ -distance) between $q_{1}$ and $q_{2}$ . With probability $5/6$ ,

[TABLE]

∎

4.2 Reshaping the distributions

Using Algorithm 1, given $p$ , we can obtain a mixture parameter $\alpha$ and a mixture distribution $q_{\alpha}$ for which (i) if $p$ is the mixture of $q_{1}$ and $q_{2}$ with parameter $\alpha^{*}$ , then $q_{\alpha}$ is $\epsilon^{\prime}$ -close to $p$ for a proximity parameter $\epsilon^{\prime}=\epsilon/6$ , and $\alpha\leq\alpha^{*}$ and (ii) if $p$ is $\epsilon$ -far from being a mixture, then $p$ is $\epsilon$ -far from $q_{\alpha}$ . Ideally, we wish to use an identity tester to see if $q_{\alpha}$ and $p$ are roughly the same or far from each other. Unfortunately, this is not possible in general, unless $p$ and $q_{\alpha}$ are very close on every domain element. To resolve this issue, the goal in this section is to introduce two distributions $p^{\prime}$ and $q^{\prime}_{\alpha}$ such that (i) when $p$ and $q_{\alpha}$ are close, $p^{\prime}$ and $q^{\prime}_{\alpha}$ are very close to each other on every domain element and (ii) when $p$ and $q_{\alpha}$ are far, $p^{\prime}$ and $q^{\prime}_{\alpha}$ are far. Our reshaping process is inspired by the method of [DK16]: For each element $i\in[n]$ , using $q_{\alpha}$ , we define:

[TABLE]

Note that the process in [DK16] uses only the first and third terms of the above sum in defining $a_{i}$ . We start the reshaping process by associating $a_{i}\geq 1$ buckets to each domain element $i\in[n]$ to form a new domain $D=\{(i,j)\leavevmode\nobreak\ \mid\leavevmode\nobreak\ i\in[n]$ and $j\in[a_{i}]\}$ . To draw a sample from $p^{\prime}$ , we first draw a sample $i$ from $p$ , then we sample $j$ from $[a_{i}]$ , and return the pair $(i,j)$ as the sample from $p^{\prime}$ . We say $p^{\prime}$ is a reshaping of $p$ with respect to $q_{\alpha}$ . Clearly $p^{\prime}(i,j)=p(i)/a_{i}$ . In a similar manner, we define the reshaping $q^{\prime}_{\alpha}$ of $q_{\alpha}$ and once again, we have $q^{\prime}_{\alpha}(i,j)=q_{\alpha}(i)/a_{i}$ .

We next prove several crucial properties of the reshaped distributions.

Lemma 4.2.

Let $p^{\prime}$ and $q_{\alpha}^{\prime}$ be the result of the reshaping of $p$ and $q_{\alpha}$ with respect to $q_{\alpha}$ as described above. Then, the following hold:

(i)

The $\ell_{1}$ -distance after reshaping does not change: $\|p-q_{\alpha}\|_{1}=\|p^{\prime}-q^{\prime}_{\alpha}\|_{1}$ . 2. (ii)

The domain size of $p^{\prime}$ and $q^{\prime}_{\alpha}$ , $|D|\leq 3n$ . 3. (iii)

The $\ell_{2}$ -norm of $q^{\prime}_{\alpha}$ , $\|q^{\prime}_{\alpha}\|_{2}\leq\sqrt{3/n}$ . 4. (iv)

If $p$ is a mixture distribution, $q_{\alpha}$ is $\epsilon^{\prime}$ -close to $p$ , and $\alpha$ is at most $\alpha^{*}$ , then $|p^{\prime}(i,j)\leavevmode\nobreak\ -\leavevmode\nobreak\ q^{\prime}_{\alpha}(i,j)|\leq\epsilon^{\prime}/n$ for all $(i,j)\in D$ .

Proof.

To prove (i), note that

$\displaystyle{\|p^{\prime}-q^{\prime}_{\alpha}\|_{1}=\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{a_{i}}|p^{\prime}(i,j)-q^{\prime}_{\alpha}(i,j)|=\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{a_{i}}\frac{|p(i)-q_{\alpha}(i)|}{a_{i}}=\sum\limits_{i=1}^{n}|p(i)-q_{\alpha}(i)|=\|p-q_{\alpha}\|_{1}.}$

For (ii), we have:

[TABLE]

We now prove (iii). If $q_{\alpha}(i)<1/n$ , then $q^{\prime}_{\alpha}(i,j)<1/n$ . If $q_{\alpha}(i)\geq 1/n$ , then $a_{i}\geq nq_{\alpha}(i)$ and hence $q^{\prime}_{\alpha}(i,j)\leq 1/n$ . Therefore, $q^{\prime}_{\alpha}(i,j)\leq 1/n$ for all $i,j$ . Since the domain size of $q^{\prime}_{\alpha}$ is at most $3n$ , $\|q^{\prime}_{\alpha}\|_{2}$ is at most $\sqrt{3/n}$ .

Finally, we show (iv). Since $p$ is a mixture distribution, there is an $\alpha^{*}\in[0,1]$ such that $p=(1-\alpha^{*})q_{1}+\alpha^{*}\,q_{2}$ . Also, we have that $q_{\alpha}$ has a mixture parameter $\alpha\leq\alpha^{*}$ . Furthermore, we also have $\|p-q_{\alpha}\|_{1}\leq\epsilon^{\prime}$ . Let $\beta=(\alpha^{*}-\alpha)/(1-\alpha)$ which is in $[0,1]$ . Observe that

$\displaystyle{(1-\beta)q_{\alpha}+\beta\,q_{2}=(1-\beta)(1-\alpha)q_{1}+\left(\alpha(1-\beta)+\beta\right)q_{2}=(1-\alpha^{*})q_{1}+\alpha^{*}\,q_{2}=p\,.}$

Thus, $p$ is a mixture of $q_{\alpha}$ and $q_{2}$ . For an element $(i,j)\in D$ , we can bound the difference of $p^{\prime}(i,j)$ and $q^{\prime}_{\alpha}(i,j)$ as follows, which finishes the proof.

[TABLE]

∎

4.3 The mixture testing algorithm

In this section, we use the learner and the reshaped distributions to obtain an identity tester for mixtures of two known distributions.

Theorem 4.3.

*Given a proximity parameter $\epsilon$ , Algorithm 2 is identity tester in the presence of known noise that uses $O(\sqrt{n}/\epsilon^{2})$ samples. *

Proof.

Let $\epsilon^{\prime}=\epsilon/6$ . In the completeness case, $p$ is a mixture distribution with parameter $\alpha^{*}$ . Therefore, with probability at least 5/6, $q_{\alpha}$ is a mixture distribution with parameter $\alpha$ for which $q_{\alpha}$ is $\epsilon^{\prime}$ -close to $p$ . Let $p^{\prime}$ and $q^{\prime}_{\alpha}$ be the reshaped distributions described in Section 4.2. By Lemma 4.2(ii), $|D|\leq 3n$ . Moreover, for any $(i,j)\in D$ , $|p^{\prime}(i,j)-q^{\prime}_{\alpha}(i,j)|\leq\epsilon^{\prime}/n$ , which implies that

[TABLE]

Conversely, if $p$ is $\epsilon$ -far from being a mixture distribution, then it has to be $\epsilon$ -far from $q_{\alpha}$ . By Lemma 4.2, $p^{\prime}$ and $q^{\prime}_{\alpha}$ are $\epsilon$ -far from each other. Therefore, $\|p^{\prime}-q^{\prime}_{\alpha}\|_{2}^{2}\geq\epsilon/\sqrt{|D|}$ . Using the identity tester (Identity-Tester) provided in [DK16] (see Remark 2.7 and Remark 2.8), there exists an algorithm that can distinguish the above cases with probability $5/6$ using $O(\sqrt{|D|}/\epsilon^{2})$ samples. Thus, with probability $2/3$ both the invoked learner and the tester returns the right answer. Also, the sample complexity is $O(\sqrt{n}/\epsilon^{2}+1/\epsilon^{2})=O(\sqrt{n}/\epsilon^{2})$ . Hence the proof is complete. ∎

5 Testing mixtures in the presence of noise that is accessible via samples

In this section, we provide an algorithm for the testing closeness of distributions in the presence of noise that is accessible via samples. We assume we have sample access to three distributions $p$ , $q_{1}$ , and $q_{2}$ , over $[n]$ and the goal is to test if $p$ is a mixture of $q_{1}$ and $q_{2}$ . Our approach is first to learn $p$ in an indirect manner. Specifically, we design an algorithm that finds a candidate mixture distribution $q_{\alpha}\coloneqq(1-\alpha)q_{1}+\alpha\,q_{2}$ such that with high probability if $p$ is a mixture of $q_{1}$ and $q_{2}$ , then $q_{\alpha}$ will be close to $p$ . We claim that the answer to the test “is $p$ close to $q_{\alpha}$ ” can be used to test if $p$ is close to a mixture of $q_{1}$ and $q_{2}$ . Indeed, if $p$ is a mixture distribution, by the property of the learning algorithm, $q_{\alpha}$ is close to $p$ and hence the test will accept. Conversely, if $p$ is far from being a mixture, then $p$ is far from any mixture distribution including $q_{\alpha}$ , and hence the test will reject.

In particular, the candidate $q_{\alpha}$ will be such that (i) if $p$ is a mixture, then $\|p-q_{\alpha}\|_{2}\leq c\epsilon/\sqrt{n}$ for a sufficiently small constant $c$ and (ii) if $p$ is $\epsilon$ -far from being a mixture, then $\|p-q_{\alpha}\|_{2}\geq\epsilon/\sqrt{n}$ . As we will see, the robust $\ell_{2}$ -distance tester of [CDVV14] can efficiently distinguish these two cases. Since we are looking for $q_{\alpha}$ that is close to $p$ in $\ell_{2}$ -distance, we study how we can estimate the $\ell_{2}$ -distance between $p$ and a mixture distribution $q_{1}$ and $q_{2}$ . Let $s$ be the expected number of samples we draw; $s$ will be specified later. Assume we draw $\mbox{\sl Poi}(s)$ samples222 $\mbox{\sl Poi}(s)$ is a Poisson random variable with parameter $s$ . from $p$ , $q_{1}$ , and $q_{2}$ . Let $X$ , $Y$ , and $Z$ denote the (multi)set of samples from $p$ , $q_{1}$ , and $q_{2}$ respectively. Let $X_{i}$ , $Y_{i}$ , and $Z_{i}$ be the numbers of instances of element $i\in[n]$ in each sample set. Consider the following statistic:333 This is motivated by the $\ell_{2}$ -distance estimator proposed in [CDVV14] in which they draw a set of samples from $p$ and $q$ and use the statistic $\sum_{i=1}^{n}(X_{i}-Y_{i})^{2}-X_{i}-Y_{i}$ , where $X_{i}$ (resp., $Y_{i}$ ) is the number of times $i\in[n]$ occurs in the samples from $p$ (resp., $q$ ).

[TABLE]

Note that if we fix the sample sets $X$ , $Y$ , and $Z$ , $f$ is a quadratic function of $\alpha$ . We show that the above statistic has the expected value of $s^{2}\|p-q_{\alpha}\|_{2}^{2}$ .

If $p$ is a mixture distribution with parameter $\alpha^{*}$ , then $\mathrm{\mathbf{E}}\boldsymbol{\left[\vphantom{f(\alpha^{*})}\right.}{f(\alpha^{*})}\boldsymbol{\left.\vphantom{f(\alpha^{*})}\right]}=0$ , where the expectation is taken over the randomness of the samples. Hence, a natural candidate to approximate $\alpha^{*}$ is some $\alpha$ where $f$ achieves its (near-)minimum. To do this, we first show that if $p$ is a mixture, then we can choose a threshold parameter $T$ such that $|f(\alpha^{*})|\leq T$ with high probability. Then we pick $\alpha\in[0,1]$ that minimizes $f(\alpha)$ with the constraint that $|f(\alpha)|\leq T$ . Since $f$ is a quadratic, let $\hat{\alpha}_{1}$ and $\hat{\alpha}_{2}$ be the solutions. We then show that if $p$ is a mixture of $q_{1}$ and $q_{2}$ , with high probability, at least one of $\|p-q_{\hat{\alpha}_{1}}\|_{2}$ or $\|p-q_{\hat{\alpha}_{2}}\|_{2}$ is small (Section 5.1).

For the rest of the section, let $b$ be a parameter specified later for which $\|p\|_{2}^{2}$ , $\|q_{1}\|_{2}^{2}/2$ , and $\|q_{2}\|_{2}^{2}/2$ are bounded by $b$ . Thus for any mixture distribution $q_{\alpha}$ , we have $\|q_{\alpha}\|^{2}_{2}\leq b$ . Also, let $\gamma=\epsilon^{2}/(10n)$ , $s=c_{s}\cdot(\sqrt{b}/(\gamma/2))$ for a sufficiently large constant $c_{s}$ , and let $T=s^{2}\gamma$ . (All missing proofs are in Section 5.3.)

5.1 Finding candidates

In this section, we aim to learn a mixture distribution. More precisely, we are looking for $\alpha$ ’s such that if $p$ is a mixture distribution, then with high probability, $\|p-q_{\alpha}\|_{2}\leq\epsilon/(2\sqrt{n})$ .

Theorem 5.1.

Suppose $p$ is a mixture distribution. Given $X$ , $Y$ , and $Z$ , with probability $0.94$ , one can compute a candidate set $\mathcal{M},|\mathcal{M}|\leq 5$ , for which there exists $\alpha\in\mathcal{M}$ such that $\|p-q_{\alpha}\|_{2}^{2}\leq\frac{\epsilon^{2}}{4n}\,.$

Proof.

We consider two cases based on the $\ell_{2}$ -distance of $q_{1}$ and $q_{2}$ . Suppose $\|q_{1}-q_{2}\|_{2}^{2}\leq\epsilon^{2}/(4\,n)$ . If $p$ is a mixture distribution with parameter $\alpha^{*}$ , then we have:

[TABLE]

Hence, $q_{1}$ , which is a mixture distribution with parameter $\alpha=0$ , is a candidate.

We now focus on the case $\|q_{1}-q_{2}\|_{2}^{2}\geq\epsilon^{2}/(4\,n)$ . Without loss of generality, $\alpha^{*}>1/2$ (otherwise swap $q_{1}$ and $q_{2}$ ). Fixing $X,Y,Z$ , we write (1) as $f(\alpha)=A\alpha^{2}+B\alpha+C$ , where

[TABLE]

As explained earlier, the idea is to use $f(\alpha)$ to find a proper candidate $\alpha$ .

We now study the properties of $A$ and $B$ . It turns out that $A$ is the same as the statistic for testing closeness where $\mbox{\sl Poi}(s)$ samples are drawn from $q_{1}$ and $q_{2}$ . From [CDVV14],

[TABLE]

In the following lemma, we show how statistic with similar expected value and variance to $A$ will concenterate:

Lemma 5.2.

[Adapted from [CDVV14]] Assume a random variable, namely $R$ , has the following properties:

[TABLE]

where $c_{1}$ , $c_{2}$ , and $c_{3}$ are three positive constants, $s$ is an integer, $q_{1}$ and $q_{1}$ are two distributions over $[n]$ , and $b$ is a real number which is greater than $\|q_{1}\|_{2}^{2}$ and $\|q_{2}\|_{2}^{2}$ . If $s$ is at least $c\cdot\sqrt{b}/\tau$ for sufficiently large $c$ , then with probability 0.99 the following is true:

•

If $\|q_{1}-q_{2}\|_{2}^{2}$ is at most $\tau$ , then $|R|$ is at most $2\,c_{1}\,\tau\,s^{2}$ .

•

If $\|q_{1}-q_{2}\|_{2}^{2}$ is at least $\tau$ , then $R$ is between $0.9\cdot\mathrm{\mathbf{E}}\boldsymbol{\left[\vphantom{R}\right.}{R}\boldsymbol{\left.\vphantom{R}\right]}$ and $1.1\cdot\mathrm{\mathbf{E}}\boldsymbol{\left[\vphantom{R}\right.}{R}\boldsymbol{\left.\vphantom{R}\right]}$ .

Thus, using the above lemma, we show that with probability 0.99, there is a constant $c_{A}\in[0.9,1.1]$ such that

[TABLE]

$B$ might not have a nice closed-form expression when $p$ is an arbitrary distribution, but when $p$ is a mixture, it has the following property.

Lemma 5.3.

Suppose $p$ is a mixture of $q_{1}$ and $q_{2}$ with parameter $\alpha^{*}\geq 1/2$ . Let $B$ be a function of the sample sets $X$ , $Y$ , and $Z$ as defined in (2) and let $\gamma<\|q_{1}-q_{2}\|_{2}^{2}$ . If the sample sets, $X$ , $Y$ , and $Z$ , each have $\Theta(\sqrt{b}/\gamma)$ samples, then with probability 0.99, there exists $c_{B}\in[0,1]$ such that

[TABLE]

We now analyze $f(\alpha)$ for a fixed $\alpha$ . (The following in fact holds for any distribution $p$ over $[n]$ .)

Lemma 5.4.

For a fixed $\alpha$ ,

[TABLE]

By Lemma 5.4 and Lemma 5.2, with probability 0.99, if $p=(1-\alpha^{*})q_{1}+\alpha^{*}q_{2}$ , then

[TABLE]

With probability 0.97, all of (4), (5), and (6) hold; we condition on this from now on.

Since $f(\alpha)$ is a quadratic and since $A>0$ from (4), let $\alpha_{\min}=-B/(2A)$ where $f$ achieves its minimum. We define $\hat{\alpha}_{1}$ and $\hat{\alpha}_{2}$ as follows:

[TABLE]

Note that (6) guarantees that either $\hat{\alpha}_{1}$ or $\hat{\alpha}_{2}$ exists depending on if $\alpha^{*}>\alpha_{\min}$ or not; they can also be found very efficiently by binary search. It remains to show that one of $q_{\hat{\alpha}_{1}}$ and $q_{\hat{\alpha}_{2}}$ is very close to $p$ in $\ell_{2}$ -distance.

Lemma 5.5.

We have either $\|p-q_{\hat{\alpha}_{1}}\|_{2}\leq\frac{2\,T}{0.9\,s^{2}}$ or $\|p-q_{\hat{\alpha}_{2}}\|_{2}\leq\frac{2\,T}{0.9\,s^{2}}.$

Note that by choice of our parameters, we have $2\,T/(0.9s^{2})<\epsilon^{2}/(4\,n)$ . Hence, either $q_{\hat{\alpha}_{1}}$ or $q_{\hat{\alpha}_{2}}$ is a candidate. Thus our potential candidates so far are $\alpha=0$ , $q_{\hat{\alpha}_{1}}$ , and $q_{\hat{\alpha}_{1}}$ . In addition, given our assumption for $\alpha^{*}>1/2$ , we need to compute the corresponding ${\hat{\alpha}_{1}}$ and ${\hat{\alpha}_{2}}$ when $q_{1}$ and $q_{2}$ are swapped. Hence, we have at most five candidates for $\alpha$ . ∎

5.2 Mixture closeness tester

In this section, we provide our algorithm and prove its correctness in the following theorem.

Theorem 5.6.

Given a proximity parameter $\epsilon$ , Algorithm 3 is an closeness tester in the presence of noise that is accessible via samples and it uses $\Theta(\sqrt{n}/\epsilon^{2}+n^{2/3}/\epsilon^{4/3})$ samples.

Proof.

We reduce the $\ell_{2}$ -norm of the three input distributions via the reshaping technique proposed in [DK16]. Let $S$ be a multi-set consisting of $3\,k$ samples, where $k$ samples are chosen from each distribution $p$ , $q_{1}$ , and $q_{2}$ . For $i\in[n]$ , we assign $b_{i}$ buckets to element $i$ where $b_{i}$ is the number of instances of element $i$ in set $S$ plus one. For a distribution $d$ over $[n]$ , we define $d^{\prime}$ to be a distribution over all the buckets, $D\coloneqq\{(i,j)\leavevmode\nobreak\ \mid\leavevmode\nobreak\ i\in[n]\mbox{ and }j\in[b_{i}]\}$ . We generate a sample from $d^{\prime}$ via the following process: (i) draw a sample $i\sim d$ , (ii) pick $j\in[b_{i}]$ uniformly at random, and (iii) output $(i,j)$ . The probability of any element $(i,j)$ according to $d^{\prime}$ is $d(i)/b_{i}$ . It is known that flattening does not change the $\ell_{1}$ -distance between two distributions. Let $p^{\prime}$ , $q^{\prime}_{1}$ , and $q^{\prime}_{2}$ be the distributions $p$ , $q_{1}$ , and $q_{2}$ after flattening. We show that a mixture distribution will remain a mixture after flattening. More precisely, if $p$ is a mixture of $q_{1}$ and $q_{2}$ with parameter $\alpha^{*}$ , then it is easy to see that $p^{\prime}$ is a mixture of distributions $q^{\prime}_{1}$ and $q^{\prime}_{2}$ with the same parameter $\alpha^{*}$ . Thus, it suffices to test if $p^{\prime}$ is a mixture of $q^{\prime}_{1}$ and $q^{\prime}_{2}$ .

By setting $k=\Theta(\min(n,n^{2/3}/\epsilon^{4/3}))$ , according to [DK16, Lemma II.6] and Markov’s inequality, we can assume the $\ell_{2}$ -norms of all three distributions $p^{\prime}$ , $q^{\prime}_{1}$ , and $q^{\prime}_{2}$ are at most $b$ with probability at least $0.99$ , where we set $b=1/\min(n,n^{2/3}/\epsilon^{4/3})=\Theta(1/k)$ . Also, note tht $|D|=\Theta(n)$ .

Given Theorem 5.1, one can find a set $\mathcal{M}$ of at most five candidates. If $p^{\prime}$ is a mixture of $q^{\prime}_{1}$ and $q^{\prime}_{2}$ , then there is an $\alpha\in\mathcal{M}$ such that $\|q^{\prime}_{\alpha}-p^{\prime}\|_{2}\leq\epsilon/(2\sqrt{|D|})$ . On the other hand, if $p^{\prime}$ is $\epsilon$ -far from being a mixture, it is also $\epsilon$ -far from all $\alpha\in\mathcal{M}$ ; using the Cauchy–Schwarz inequality, we have $\|q^{\prime}_{\alpha}-p^{\prime}|_{2}\geq\epsilon/\sqrt{|D|}$ . Note that [CDVV14] showed one can estimate the $\ell_{2}$ -distance accurately using $\Theta(|D|\cdot\sqrt{b}/\epsilon^{2})$ samples and with probability 0.99 (see Lemma 5.7 in Section 5.3.)

By a union bound, the probability that $\mathcal{M}$ does not contain the right $\alpha$ , the probability that the $\ell_{2}$ estimation fails, and the probability that $\mbox{\sl Poi}(\lambda)<100\lambda$ sum up to below $1/3$ . Hence, with probability 2/3, the algorithm outputs the right answer and the total number of samples is $\Theta(k+n\sqrt{b}/\epsilon^{2})=\Theta(\sqrt{n}/\epsilon^{2}+n^{2/3}/\epsilon^{4/3})$ . ∎

5.3 Proofs for Section 5.1 and Section 5.2

In this section, we present the proofs of the lemmas presented earlier in Section 5.1 and Section 5.2.

See 5.2

Proof.

We use Chebyshev’s inequality to prove the lemma. For the first case, by the $\ell_{p}$ -norms inequality, we have the following:

[TABLE]

where the last inequality is true when $s\geq\max(200\,c_{2}/c_{1}^{2},\sqrt{200\,c_{3}}/c_{1})\sqrt{b}/\tau$ .

For the second case, we have the following:

[TABLE]

where the last inequality is true when $s\geq\max(20000\,c_{2}/c_{1}^{2},\sqrt{20000\,c_{3}}/c_{1})\sqrt{b}/\tau$ . This completes the proof. ∎

See 5.3

Proof.

Recall that $B$ is defined to be $2\sum_{i=1}^{n}Y_{i}+X_{i}Y_{i}+Y_{i}Z_{i}-Y_{i}^{2}-X_{i}Z_{i}$ . To analyze the expected value and the variance of $B$ , we consider each terms in the sum. Let $B_{i}$ denote a single term in the sum after ignoring constant 2:

[TABLE]

Note that via the Poissonization method, the $X_{i}$ ’s, the $Y_{i}$ ’s, the $Z_{i}$ ’s, and consequently the $B_{i}$ ’s are independent random variables. Note that if $x$ is a Poisson random variable with mean $\lambda$ , then $\mathrm{\mathbf{E}}\boldsymbol{\left[\vphantom{x^{2}}\right.}{x^{2}}\boldsymbol{\left.\vphantom{x^{2}}\right]}$ is $\lambda^{2}+\lambda$ . Using this equation, we compute the expected value of $B_{i}$ :

[TABLE]

Thus, the expected value of $-B$ is the following:

[TABLE]

where $2\alpha^{*}$ is a constant between $[1,2]$ . Using the first four moments of the Poisson distribution and the fact that $\alpha\leq 1$ , we have the following:

[TABLE]

Using the bound above and the Cauchy–Schwarz inequality, we bound the variance of $B$ as follows:

[TABLE]

Clearly, the variance of $-B$ is equal to the variance of $B$ , and it is bounded the same as above. Note that $\gamma$ is at most $\|q_{1}-q_{2}\|_{2}^{2}$ , and the sample sets, $X$ , $Y$ , and $Z$ , each have at least $\Theta(\sqrt{b}/\gamma)$ samples. By Lemma 5.2, there exists $c_{B}\in[0,1]$ such that

[TABLE]

with probability 0.99 which concludes the proof. ∎

See 5.4

Proof.

In this proof, we adapt the proof of Proposition 3.1 from [CDVV14]. Recall that

[TABLE]

Via the Poissonization method, we can assume $X_{i}$ (similarly $Y_{i}$ and $Z_{i}$ ) is a random variable from $\mathrm{Poi}(s\,p(i))$ (similarly $\mathrm{Poi}(s\,q_{1}(i))$ and $\mathrm{Poi}(s\,q_{2}(i))$ ), which is drawn independently from the rest of the random variables. Note that if $x$ is a Poisson random variable with mean $\lambda$ , then $\mathrm{\mathbf{E}}\boldsymbol{\left[\vphantom{x^{2}}\right.}{x^{2}}\boldsymbol{\left.\vphantom{x^{2}}\right]}$ is $\lambda^{2}+\lambda$ . Using this equation and the independence of the random variables, for a fixed $\alpha$ , we have:

[TABLE]

Now, we bound the variance of $f(X,Y,Z,\alpha)$ for a fixed $\alpha$ . Let $W_{i}$ denote a single term in the summation:

[TABLE]

Using the moments of the Poisson distribution, we have

[TABLE]

Now, we bound the variance of $f$ which is the sum of $n$ independent terms, $W_{i}$ ’s. Using the Cauchy–Schwarz inequality, and the fact that $(p(i)+q_{\alpha}(i))^{2}$ is at most $2p(i)^{2}+2q_{\alpha}(i)^{2}$ , we have

[TABLE]

where $b$ is at least $\|p\|_{2}^{2}$ and $\|q_{\alpha}\|_{2}^{2}$ by the first condition of the theorem. ∎

See 5.5

Proof.

Consider the statistic as a function of $\alpha$ : $f(\alpha)=A\alpha^{2}+B\alpha+C$ . Since $A$ is positive, $f$ takes its minimum at $\alpha_{\min}\coloneqq(-B)/2A$ . By Equation 4 and Equation 5, $\alpha_{\min}$ is $c_{B}\alpha^{*}/c_{A}$ , and for any $\alpha$ , we have:

[TABLE]

Depending on whether $\alpha^{*}$ is larger than $\alpha_{\min}$ or not, we consider the following cases.

Case 1: $\boldsymbol{\alpha^{*}\geq\alpha_{\min}}$ . Let $\hat{\alpha}_{1}$ be the smallest number in $[\alpha_{\min},1]$ for which $|f(\hat{\alpha}_{1})|$ is at most $T$ . Clearly, $\hat{\alpha}_{1}$ exists since $\alpha^{*}$ is a potential solution, so the solution interval is not empty. Note that based on the way we pick $\hat{\alpha}_{1}$ , the following are true: (i) $\hat{\alpha}_{1}$ is at most $\alpha^{*}$ , (ii) $f(\hat{\alpha}_{1})$ is at least $-T$ , and (ii) since $A$ is positive, and $f$ is increasing over $[\alpha_{m}in,1]$ , then $f(\hat{\alpha}_{1})$ is at most $f(\alpha^{*})$ . Hence, by Equation 8, we have:

[TABLE]

If we replace ${\alpha^{*}}+\hat{\alpha}_{1}-2\alpha_{\min}$ by a smaller quantity , $\alpha^{*}-\hat{\alpha}_{1}$ , where both are positive then we have:

[TABLE]

Case 2: $\boldsymbol{\alpha^{*}\leq\alpha_{\min}}$ . We replicate what we did in the previous case. Let $\hat{\alpha}_{2}$ be the largest number in $[0,\alpha_{\min}]$ for which $|f(\hat{\alpha}_{2})|$ is at most $T$ . Clearly, $\hat{\alpha}_{2}$ exists since $\alpha^{*}$ is a potential solution, so the solution interval is not empty. Note that based on the way we pick $\hat{\alpha}_{2}$ , the following are true: (i) $\hat{\alpha}_{2}$ is at least $\alpha^{*}$ , (ii) $f(\hat{\alpha}_{2})$ is at least $-T$ , and (iii) since $A$ is positive, and $f$ is decreasing over $[0,\alpha_{m}in]$ , then $f(\hat{\alpha}_{2})$ is at most $f(\alpha^{*})$ . Hence, by Equation 8, we have:

[TABLE]

If we replace $2\alpha_{\min}{\alpha^{*}}-\hat{\alpha}_{2}$ by a smaller quantity, $\hat{\alpha}_{2}-\alpha^{*}$ , where both are positive, then we have:

[TABLE]

The left side of Equation 9 and Equation 10 are in the form of the $\ell_{2}$ -distance between two mixture distributions $p$ and $q_{\hat{\alpha}}$ due to the following:

[TABLE]

Note that we are either in case 1 or case 2. So, on of the two equations, Equation 9, Equation 10 has to be true. By Equation 11, of the following is true.

[TABLE]

which concludes the proof. ∎

Lemma 5.7.

[Restated from [CDVV14]] The procedure $\ell_{2}^{2}$ -Estimator $(b,\sigma,r_{1},r_{2})$ described in Algorithm 4, that uses $\Theta(\sqrt{b}/\sigma)$ samples, has the following property with probability 0.99:

•

If $\|r_{1}-r_{2}\|_{2}^{2}$ is at most $\sigma$ , then $|R|$ is at most $2\,\sigma\,s^{2}$ .

•

If $\|r_{1}-r_{2}\|_{2}^{2}$ is at least $\sigma$ , then $R$ is between $0.9\cdot\mathrm{\mathbf{E}}\boldsymbol{\left[\vphantom{R}\right.}{R}\boldsymbol{\left.\vphantom{R}\right]}$ and $1.1\cdot\mathrm{\mathbf{E}}\boldsymbol{\left[\vphantom{R}\right.}{R}\boldsymbol{\left.\vphantom{R}\right]}$ .

Proof.

We use the $\ell_{2}^{2}$ -distance estimator proposed in [CDVV14]. However, for the sake of completeness, we provide the process in Algorithm 4. $X$ and $Y$ are two sample sets each containing $s$ samples from $r_{1}$ and $r_{2}$ respectively. Let $X_{i}$ and $Y_{i}$ indicate the numbers of samples in $X$ and $Y$ respectively. The authors showed that the expected value of the statistic $\sum_{i=1}^{n}(X_{i}-Y_{i})^{2}-X_{i}-Y_{i}$ is $s^{2}\|r_{1}-r_{2}\|_{2}^{2}$ , and the variance is bounded by $8s^{3}\,\|r_{1}-r_{2}\|_{4}^{2}\,\sqrt{b}+8\,s^{2}\,b$ . By Lemma 5.2, if we draw $\Theta(\sqrt{b}/\gamma)$ samples, then the algorithm will have the desired property with probability 0.99. ∎

6 Testing under $k$ -flat noise

We have so far considered the problems of identity testing and closeness testing in the presence of the noise that is directly accessible and proved these problems have the same sample complexity as their respective noise-free versions. These results raise the question of whether one can replace the requirement of access to the noise by an assumption that restricts the noise to be in a class of distributions and still achieve improved sample complexity compared to the near-linear lower bound we mentioned earlier. In this section we develop a tester for identity testing when the noise distribution belongs to the class of $k$ -flat distributions without any further information. This assumption means that the noise can be any $k$ -flat distribution, while the parameters of the $k$ -flat distribution are not known to the tester, nor given via samples.

6.1 Preliminaries

We begin by formally defining $k$ -flat distributions: We say $\mathcal{I}=\{I_{1},\ldots,I_{k}\}$ is a $k$ -segmentation of $[n]$ if and only if $I_{1},\ldots,I_{k}$ are $k$ disjoint intervals that cover $[n]$ . Also, we say a function $f:[n]\rightarrow\mathbb{R}$ is a $k$ -flat function if and only if there is a $k$ -segmentation of $[n]$ , namely $\mathcal{I}=\{I_{1},\ldots,I_{k}\}$ , such that for any two elements, $x$ and $y$ , in the same interval in $\mathcal{I}$ , $f(x)$ is equal to $f(y)$ . A distribution is a $k$ -flat distribution if and only if its probability mass function is a $k$ -flat function.

We next define concepts that will be necessary for describing our algorithms. For any distribution $p$ and a partition $\mathcal{D}=\{D_{1},\ldots D_{t}\}$ of its domain, the coarsening of $p$ over $\mathcal{D}$ , denoted by $p_{\langle D\rangle}$ , is a distribution over the sets in $\mathcal{D}$ where the probability of each set $D_{i}$ is $\sum_{x\in D_{i}}p(x)$ . For a subset $D\subseteq[n]$ , we define the restriction of $p$ to $D$ , denoted by $p_{|D}$ , to be a distribution over $D$ for which the probability of $x\in D$ is equal to $p(x\leavevmode\nobreak\ \mid\leavevmode\nobreak\ x\in D)$ . Although the restriction is well-defined only when $p(D)$ is not zero, abusing notation, we define $\|p_{|D}-q_{|D}\|_{1}$ to be zero if $p(D)$ or $q(D)$ is zero.

Also, throughout this section, we study different schemes for partitioning the domain. In addition to $k$ -segmentation, which is defined earlier, two other schemes are defined as follows: Given a known distribution $q$ , Batu et al. in [BFF*+*01] provide a partitioning scheme, called bucketing, which places elements with similar probability in the same bucket. Note that, in contrast with $k$ -segmentation, this scheme does not necessarily place consecutive elements in the same bucket.

Definition 6.1 (Similar to [BFF+01]).

Assume we have a known distribution $q$ over $[n]$ . Given a parameter $\epsilon$ , we define the bucketing of the domain, Bucket $(q,n,\epsilon)$ , to be a set of $v$ subsets of the domain, $\mathcal{B}=\{B_{1},\ldots,B_{v}\}$ , where each subset is defined as below:

[TABLE]

We define the last partitioning scheme below. This partition is a refinement of the bucketing with respect to a $k$ -segmentation $\mathcal{I}$ .

Definition 6.2.

Assume $\mathcal{I}$ is a $k$ -segmentation of $[n]$ , and $\mathcal{B}$ is a bucketing of $[n]$ containing $v$ disjoint subsets. We define $\mathcal{D}(\mathcal{I},\mathcal{B})=\{D_{i,j,\ell}\}_{(i,j)\in[k]\times[v]}$ to be a division of the domain for which the $D_{i,j,\ell}$ ’s are the intersection of the $i$ th interval and the $j$ th bucket. Formally, $D_{i,j,\ell}$ is defined as:

[TABLE]

The problem of testing identity in the presence of $k$ -flat noise.

Suppose we are given a known distribution $q$ , and sample access to a distribution $p$ both over the domain $[n]$ . Let $\mathcal{C}$ denote the class of all $k$ -flat distributions over $[n]$ . The problem of testing identity in the presence of $k$ -flat noise boils down to distinguishing the following cases with probability at least 2/3:

•

There exists a mixture parameter $\alpha^{*}$ and a $k$ -flat distribution $r^{*}$ over $[n]$ such that $p$ is a mixture of $q$ and $r^{*}$ with parameter $\alpha^{*}$ , i.e., $p=(1-\alpha^{*})q+\alpha^{*}\,r^{*}$ .

•

$p$ is $\epsilon$ -far from any distribution of the form $(1-\alpha)q+\alpha\,r$ where $r\in\mathcal{C}$ and $\alpha\in[0,1]$ .

6.2 The algorithm

We start by explaining the properties of the partitioning schemes we defined earlier. Let $\mathcal{B}=\textsc{Bucket}(q,n,\epsilon^{\prime})$ be the bucketing of the domain elements for a parameter $\epsilon^{\prime}\coloneqq\epsilon/14$ . The algorithm can obtain this bucketing since $q$ and $\epsilon$ is known to the algorithm. The bucketing scheme is designed such that the probabilities of the elements in a bucket are within a $(1+\epsilon)$ -factor of each other (except for $B_{1}$ ). This property implies that the restriction of $q$ to any bucket is extremely close to the uniform distribution.

Now, assume that $p$ is in fact a mixture of $q$ and a $k$ -flat distribution $r^{*}$ . We denote the $k$ -segmentation of $r^{*}$ by $\mathcal{I}^{*}=\{I_{1}^{*},\ldots,I_{k}^{*}\}$ (which is not known to the algorithm). By definition, the restriction of $r^{*}$ on any $I^{*}_{i}\in\mathcal{I}^{*}$ is a uniform distribution. Consider the division $\mathcal{D}(\mathcal{I}^{*},\mathcal{B})$ , described in Definition 6.2. Observe that $D_{i,j,\ell}\in\mathcal{D}$ is a subset of both $I_{i}$ and $B_{j}$ . One can show that the restriction of $r^{*}$ is uniform on $D_{i,j,\ell}$ , and the restriction of $q$ to $D_{i,j,\ell}$ is very close to the uniform distribution as well. Thus, $p$ , which is assumed to be the mixture of $q$ and $r^{*}$ , must be very close to the uniform distribution on $D_{i,j,\ell}$ . We formally prove this claim in Lemma 6.5.

Based on the above observation, our tester looks for two qualities in $p$ to assert that it is a mixture distribution: Given a division $\mathcal{D}(\mathcal{I},\mathcal{B})$ , (i) are the restrictions of $p$ to the $D_{i,j,\ell}$ ’s almost uniform and (ii) is the overall shape of $p$ over $D_{i,j,\ell}$ ’s (i.e., $p_{\langle\mathcal{D}\rangle}$ ) consistent with a mixture of $q$ and a $k$ -flat noise distribution? More specifically, our tester follows these steps. For every $k$ -segmentation $\mathcal{I}$ , the tester checks that the restriction of $p$ to each $D_{i,j,\ell}\in\mathcal{D}(\mathcal{I},\mathcal{B})$ is almost uniform. If it figures out that it is not the case, it abandons the current segmentation, and start over with another one. If at some point, the tester passes this step, it checks the overall shape of $p$ . It draws enough samples from $p$ and forms the empirical distribution $\hat{p}$ from the samples. Then it checks whether there exists a $k$ -flat function, $f$ , such that $\hat{p}_{\langle\mathcal{D}\rangle}$ is consistent with a mixture of $q$ and $f$ . If the tester finds a $k$ -segmentation such that the distribution passes the two steps above, then it asserts that $p$ is a mixture and outputs accept. Otherwise, it outputs reject.

Based on our first observation, one can expect the tester to accept a mixture distribution $p$ . However, the main challenge is to show that the tester rejects when $p$ is $\epsilon$ -far from being a mixture. To prove this fact, we also use the following observation. Suppose we have two distributions $p$ and $p^{\prime}$ . Let $\mathcal{P}$ be a partition of their domain. We prove that if $p$ and $p^{\prime}$ are $\epsilon$ -far from each other, there is a noticeable discrepancy between either their coarsening distributions over $\mathcal{P}$ or their restrictions to the subsets in $\mathcal{P}$ (Lemma 6.6). This observation implies that if $p$ is $\epsilon$ -far from being a mixture distribution, then at least one the steps will fail. Hence, we distinguish both cases with high probability.

We describe our tester in Algorithm 5 and show its correctness in Theorem 6.3. Later, we also discuss how to avoid trying all $\mathcal{I}$ ’s and achieve a polynomial time algorithm. All missing proofs in the rest of this section are in Section 6.3.

Theorem 6.3.

Algorithm 5 is an identity tester in the presence of $k$ -flat noise that uses $\widetilde{O}(\sqrt{nk}/\epsilon^{3.5})$ samples.

Proof.

We set $\epsilon^{\prime}=\epsilon/14$ . We denote the number of buckets in $\mathcal{B}=\textsc{Buckets}(q,n,\epsilon^{\prime})$ by $v$ . Let $t$ denote $k\cdot v$ . Without loss of generality we assume $t\leq n$ . Otherwise, one could learn the distribution $p$ up to $\epsilon/2$ $\ell_{1}$ -distance error via $O(n/\epsilon^{2})$ samples, and trivially check if it is $\epsilon/2$ -close to a mixture of $q$ and a $k$ -flat distribution.

Consider a segmentation $\mathcal{I}$ , and a division $\mathcal{D}\coloneqq\mathcal{D}(\mathcal{I},\mathcal{B})$ . To obtain better sample complexity, we need to make sure that the size of each set in $\mathcal{D}$ is not greater than $\lceil{n/t}\rceil$ . In the case that a large set of size $z>\lceil{n/t}\rceil$ exists, we split it into $D_{i,j,\ell}\coloneqq\lfloor{z\cdot t/n}\rfloor+1$ sets of roughly the same size and denote them by $D_{i,j,\ell}$ for $\ell\in[D_{i,j,\ell}]$ . The new sets form a new partition of the domain. We call it a refined division, denoted $\widetilde{\mathcal{D}}\coloneqq\widetilde{\mathcal{D}}(\mathcal{I},\mathcal{B},n,t)$ . Note that this replacement will not asymptotically increase the total number of sets in the division, since $\widetilde{\mathcal{D}}$ has $\sum_{i,j}D_{i,j,\ell}\leq 2\,t$ many sets.

Now, we establish that for a sufficiently large number of samples, the three steps in the algorithm succeed with high probability. First, in the following lemma, we show that $O(t\cdot\log n/\epsilon^{2})$ samples are enough to obtain an empirical distribution $\hat{p}$ such that for all the divisions $\mathcal{D}$ $\hat{p}_{\langle\mathcal{D}\rangle}$ and $p_{\langle\mathcal{D}\rangle}$ are $\epsilon^{\prime}$ -close to each other with probability 0.9.

Lemma 6.4.

Assume $p$ is a distribution over $[n]$ . Let $\hat{p}$ be an empirical distribution formed by $\Theta(\min(n,kv\log n)\cdot(\log\delta^{-1})/\epsilon^{\prime 2})$ samples from $p$ . Fix a bucketing of the domain $\mathcal{B}=$ Bucket $(q,n,\epsilon^{\prime})$ . For every $k$ -segmentation $\mathcal{I}$ , and the corresponding refined division of the domain $\mathcal{D}=\mathcal{D}(\mathcal{I},\mathcal{B})$ , the coarsening of $p$ and the empirical distribution $\hat{p}$ over $\mathcal{D}$ is at most $\epsilon^{\prime}$ -far from each other with probability at least $1-\delta$ .

Second, we show that if $p(D_{i,j,\ell})$ , for a fixed $i,j,$ and $\ell$ , is at least $\epsilon^{\prime}/|\widetilde{\mathcal{D}}|=\Theta(\epsilon^{\prime}/t)$ , then $M_{i,j}$ contains at least $\epsilon^{\prime}/(4t)$ fraction of the samples with high probability. Note that there are at most $\Theta(n^{2}\cdot v)$ set $D_{i,j,\ell}$ for a fixed $\mathcal{B}$ . Using the Chernoff bound, the claim is true for all $D_{i,j,\ell}$ ’s with probability 0.9 if we draw more than $\Theta(\log(n^{2}\cdot v)t/\epsilon^{\prime})$ samples.

Third, we show if $M_{i,j}$ contains enough samples, then with high probability we can distinguish whether $\|p_{|D_{i,j,\ell}}-\mathcal{U}_{|D_{i,j,\ell}}\|_{2}^{2}$ is at most $\epsilon^{\prime 2}/{|D_{i,j,\ell}|}$ , or it is at least $2\epsilon^{\prime 2}/{|D_{i,j,\ell}|}$ : If we draw $\Theta(t/\epsilon^{\prime}\cdot(\log(n^{2}\cdot v)\cdot\sqrt{n/t}/\epsilon^{2}))$ samples, we receive $O((\log(n^{2}\cdot v)\cdot\sqrt{n/t}/\epsilon^{2})=O((\log(n^{2}\cdot v)\cdot\sqrt{{|D_{i,j,\ell}|}}/\epsilon^{2})$ samples from any set $D_{i,j,\ell}$ with $p(D_{i,j,\ell})\geq\epsilon^{\prime}/t$ . Based on [DGPP16, Theorem 1], with probability $1-1/3$ , we can distinguish whether $\|p_{|D_{i,j,\ell}}-U_{|D_{i,j,\ell}|}\|_{2}^{2}$ is at most $2\epsilon^{\prime 2}/|D_{i,j,\ell}|$ or at least $\epsilon^{2}/|D_{i,j,\ell}|$ using $\Theta(\sqrt{|D_{i,j,\ell}|}/\epsilon^{2})$ samples. By repeating this $\Theta(\log(n^{2}\cdot v))$ times and taking the majority answer, we can be assured to obtain the correct answer for the test on all the $D_{i,j,\ell}$ ’s with probability at least 0.9. Thus, we need $O(\sqrt{n\cdot t}\cdot(\log n+\log v)/\epsilon^{3})$ samples for this step.

In the above three steps, we need the following number of samples:

[TABLE]

By a union bound, the probability than any of the above steps goes wrong is at most 0.3. Hence, for the rest of the proof, we assume that the algorithm carries out the steps as expected with probability at least 2/3. Given this assumption, we show in both the completeness case and the soundness case, the algorithm outputs the correct answer.

Completeness: In this case, there exist a $k$ -flat distribution over $\mathcal{I}$ , $r$ , and a parameter $\alpha^{*}$ such that $p=(1-\alpha^{*})q+\alpha^{*}r$ . First, note that $p$ in each $D_{i,j,\ell}\in\widetilde{\mathcal{D}}$ is close to the uniform distribution. In particular, we have the following lemma.

Lemma 6.5.

Suppose $p$ is a mixture of $q$ and $r$ with parameter $\alpha$ . Let $\mathcal{I}^{*}$ , $\mathcal{B}$ , and $\widetilde{\mathcal{D}}$ , be the partitions we defined earlier. For any non-empty set, $D_{i,j,\ell}\in\mathcal{D}$ , if $j>1$ , then the restriction of $p$ to the set, $p_{|D_{i,j,\ell}}$ , is $\epsilon$ -close to the uniform distribution in $\ell_{1}$ -distance and $\epsilon/\sqrt{|D_{i,j,\ell}|}$ -close to the uniform distribution in $\ell_{2}$ -distance.

The lemma implies that for all the $D_{i,j,\ell}$ , $p_{|D_{i,j,\ell}}$ is close to the uniform distribution. Hence, the algorithm while considering the segmentation $\mathcal{I}^{*}$ , will not continue with another segmentation since $p_{|D_{i,j,\ell}}$ is being far from uniform, and the algorithm will move on to the next step.

Also, we show that a $k$ -flat function, $f$ , exists, because $r$ is a solution itself. We have $\Omega(t/\epsilon^{\prime 2})$ samples which is enough to learn the coarsening of $p$ over $\widetilde{\mathcal{D}}$ . Thus, the coarsening of the empirical distribution, $\hat{p}$ , is $\epsilon^{\prime}$ -close to the coarsening of $p$ over $\widetilde{\mathcal{D}}$ . There exists an iteration in the algorithm in which we try a parameter $\alpha$ such that $\alpha-\alpha^{*}$ is at most $\epsilon^{\prime}/2$ . Therefore, $r$ itself is a solution the algorithm is looking for:

[TABLE]

Hence, the algorithm will not output reject.

Soundness: In this case, $p$ is $\epsilon$ -far from any mixture distribution $q_{\alpha}=((1-\alpha)q_{\langle\widetilde{\mathcal{D}}\rangle}+\alpha\,r_{\langle\widetilde{\mathcal{D}}\rangle})$ for any $k$ -flat distribution $r$ and $\alpha\in[0,1]$ . We have the following structural lemma (similar to Lemma 6 in [BFF*+*01]) which bounds the distance between $p$ and $q_{\alpha}$ from above:

Lemma 6.6.

Assume $p$ and $q$ are two distributions on $[n]$ , and let $\widetilde{\mathcal{D}}$ be a refined division of the domain elements. Then, we have

[TABLE]

Since the distance between $p$ and $q_{\alpha}$ is at least $\epsilon$ , we can apply this lemma to obtain a lower bound for the two quantities in the right hand side of the equation above.

[TABLE]

At least one of the two terms on the right hand side above is greater than $7\epsilon^{\prime}$ . Net, we show if the algorithm reaches to the point that forms the empirical distribution, then the second term is at most $7\epsilon^{\prime}$ . On the other hand, if the algorithm outputs accept, then the first term is at most $5\epsilon^{\prime}$ . Hence, these two events cannot happen at the same time while $|p-q|\geq\epsilon$ .

Formally, if there is no $D_{i,j,\ell}$ such that causes the algorithm to move forward to the next segmentation, then for each $D_{i,j,\ell}$ either the weight of the set is not larger than $\epsilon^{\prime}/|\widetilde{\mathcal{D}}|$ , or the $\ell_{2}$ -distance between $P_{|D_{i,j,\ell}}$ and the uniform distribution is not more than $\sqrt{2\epsilon^{\prime 2}/|D_{i,j,\ell}|}$ . In the following lemma, we show that this situation implies that the second term in Equation 13 is at most $6.42\,\epsilon^{\prime}$ .

Lemma 6.7.

Suppose for every non-empty $D_{i,j,\ell}$ in the division $\widetilde{\mathcal{D}}=\widetilde{\mathcal{D}}(\mathcal{I},\mathcal{B})$ , either $p(D_{i,j,\ell})$ is at most $\epsilon^{\prime}/|\widetilde{\mathcal{D}}|$ , or $\|p_{|D_{i,j,\ell}}-\mathcal{U}_{|D_{i,j,\ell}}\|_{2}^{2}$ is at most $2\epsilon^{\prime 2}/|D_{i,j,\ell}|$ . Let $q_{\alpha}$ be a mixture of $q$ and a $k$ -flat distribution over $\mathcal{I}$ with an arbitrary $\alpha$ in $[0,1]$ . Then, the following holds

[TABLE]

On the other hand, if the algorithm outputs accept, it implies that there exists a function $f$ and $\alpha$ , such that $\|\hat{p}-((1-\alpha)\,q+\alpha\,f)\|_{1}$ is at most $2\epsilon^{\prime}$ . In the following lemma, we show it implies that there exists a $q_{\alpha}$ such that $\|p_{\langle\mathcal{D}\rangle}-(q_{\alpha})_{\langle\mathcal{D}\rangle}\|_{1}$ is at most $5\epsilon$ .

Lemma 6.8.

Assume $p$ , $\hat{p}$ , and $q$ are three distributions over $[n]$ , and $f:[n]\rightarrow R^{+}$ is a $k$ -flat function over $k$ -segmentation $\mathcal{I}$ . For a division $\mathcal{D}$ , suppose $\hat{p}_{\langle\mathcal{D}\rangle}$ is $\epsilon^{\prime}$ -close to $p_{\langle\mathcal{D}\rangle}$ , and there exists $\alpha\in[0,1]$ such that $\|\hat{p}-((1-\alpha)\,q+\alpha\,f)\|_{1}$ is at most $2\epsilon^{\prime}$ . Then, there exists a $k$ -flat distribution $r$ , such that $p$ is $5\epsilon^{\prime}$ -close to the mixture of $r$ and $q$ with parameter $\alpha$ .

Moreover, outputting accept means that the two terms in Equation 13 are at most $6.42\epsilon^{\prime}+5\epsilon^{\prime}<14\epsilon^{\prime}$ , which contradicts the fact that one of them has to be $7\epsilon^{\prime}$ . Hence, the proof is complete. ∎

A faster algorithm.

In the interest of a simpler exposition, the algorithm described above tries all possible $k$ -segmentations. However, there are at most $O(n^{2}\cdot v)$ possible subsets that could appear as $D_{i,j,\ell}$ ’s. Hence, one can test uniformity of $p$ on each of them separately regardless of $\mathcal{I}$ . Moreover, finding a $k$ -flat function $f$ for which the $\ell_{1}$ -distance between $\hat{p}$ and the mixture of $q$ and $f$ is minimized, can be done via dynamic programming: we define $d[i,j]$ to be the smallest $\ell_{1}$ -distance between $\hat{p}$ and mixture of $q$ and any $j$ -flat distribution when we consider only the first $i$ elements of the domain. We compute $d[i,j]$ using the previously computed $d[i^{\prime},j-1]$ :

[TABLE]

where the cost $([i^{\prime},i])$ is defined as follows: We set the cost of an interval to infinity if any subset of $[i^{\prime},i]$ which would have appeared in the divisions (i.e, all subsets in such form $[i^{\prime},i]\cap B_{z})$ for $z=1,\ldots,v$ ) does not pass the uniformity test. Otherwise, cost $([i,i^{\prime}])$ is the minimum $\ell_{1}$ -distance between $\hat{p}$ and a mixture of $q$ and a constant function for the elements in $[i^{\prime},i]$ . Since we are only looking for $k$ -flat functions rather than distributions, the updates can be computed locally and independently of the rest of segments.

6.3 Proofs for Section 6.2

In this section, we provide the proof for the lemmas stated earlier in Section 6.2.

See 6.4

Proof.

Suppose we draw $s$ samples from $p$ . Let $n_{x}$ indicate the number of occurrences of $x$ among the samples. Let $\hat{p}$ be the empirical distribution formed by $s$ samples which means that $\hat{p}(x)\coloneqq n_{x}/s$ . The goal is to show that for every segmentation $\mathcal{I}$ , the coarsening of $\hat{p}$ and $p$ over $\widetilde{\mathcal{D}}(\mathcal{I},\mathcal{B},n,t)$ are $\epsilon^{\prime}$ -close with probability at least $1-\delta$ . We build on the standard idea that is used to show that $O(n/\epsilon^{2})$ samples is sufficient to learn a distribution over $[n]$ within $\epsilon$ error in $\ell_{1}$ -distance. Consider $\widetilde{\mathcal{D}}$ which contains $\Theta(t)$ disjoint subsets of $[n]$ . The $\ell_{1}$ -distance between the coarsening of $p$ and the empirical distribution is defined as follows:

[TABLE]

We need to show that the above quantity is at most $\epsilon^{\prime}$ for any segmentation $\mathcal{I}$ and its corresponding division $\mathcal{D}(\mathcal{I},\mathcal{B})$ . However, we prove a stronger claim: Suppose we have a collection $C$ of vectors of length $n$ with entries in $\{+1,-1\}$ for which the following is true:

•

For every refined division $\widetilde{\mathcal{D}}$ , an every set $D_{i,j,\ell}\in\widetilde{\mathcal{D}}$ , there exists a vector $c\in C$ such that if $x$ is in $D_{i,j,\ell}$ , then $c_{x}=\mbox{sign}\left(\sum_{x\in D_{i,j,\ell}}p(x)-\sum_{x\in D_{i,j,\ell}}\hat{p}(x)\right)$ .

•

For all $c\in C$ , $\sum_{x\in[n]}c_{x}\cdot(p(x)-\hat{p}(x))$ is at most $\epsilon^{\prime}$ with probability at least $1-\delta$ .

The proof is complete if we establish this claim, so now we focus on proving that the collection $C$ exists. We first put a vector $c$ corresponding to each refined division. Then we show there is an upper bound for the size of the collection. Next, we show since there are not too many vectors in the collection, with high probability, $\sum_{x\in[n]}c_{x}\cdot(p(x)-\hat{p}(x))$ is at most $\epsilon^{\prime}$ for any $c\in C$ .

Clearly, there are no more than $2^{n}$ possible vectors. However, we get a better bound for the cases when $k$ is not arbitrarily large. We begin by considering a refined division $\widetilde{\mathcal{D}}$ . Fix a set $B\in\mathcal{B}$ . If two elements in $B$ are in the same interval $I_{i}$ for $i\in[k]$ , then they will have the same $c_{x}$ as well. Thus, if we sort elements in $B$ , and then write the corresponding $c_{x}$ ’s, then we get a sequence of $+1$ and $-1$ where the sign is changed in at most $\Theta(t)$ places. To uniquely represent the sequence, one can determine the indices where the sign changed and indicate whether the sequence starts with $+1$ or $-1$ . Thus, the total number of such sequences is:

[TABLE]

Note that we have at most $v$ such subsets of the domain $B\in\mathcal{B}$ . Thus, the total number of vectors in $C$ is at most $\min(2^{n},n^{\Theta(t\cdot v)})$ .

Next, we show that if we draw enough samples, the probability of $\sum_{x\in[n]}c_{x}\cdot(p(x)-\hat{p}(x))\geq\epsilon^{\prime}$ for any $c\in C$ is at most $\delta$ . Fix a vector $c=(c_{1},\ldots,c_{n})$ in $\{+1,-1\}^{n}$ . Consider the following random process: we draw a sample from $p$ , namely $x$ ; if $c_{x}$ is one, output one and otherwise output zero. In other words upon receiving sample $x$ , we output $(1+c_{x})/2$ . Assume $x_{1},\ldots,x_{s}$ are $s$ samples from $p$ that form the empirical distribution. Suppose that we generate $b_{1},\ldots,b_{s}$ according to the process using these samples from $p$ , i.e., $b_{j}=(1+c_{x_{j}})/2$ . The $b_{j}$ ’s are $s$ independent random variables with the following expected value.

[TABLE]

Clearly, the average of $b_{j}$ ’s are close to its expectation with high probability, we use this fact to show that $\sum_{x}c_{x}\cdot(p(x)-\hat{p}(x))$ are close to zero as well. Recall $n_{x}$ is the number of occurrences of element $x$ in the sample set. Using the Hoeffding bounds, we achieve:

[TABLE]

Therefore, by setting $s=\Theta(\min(n,t\cdot v\cdot\log n)\cdot(\log\delta^{-1})/\epsilon^{\prime 2})$ and using Equation 14 and a union bound, for every $c\in C$ , $\sum_{x\in[n]}c_{x}\cdot(p(x)-\hat{p}(x))\geq\epsilon^{\prime}$ is at most $\epsilon^{\prime}$ with probability $1-\delta$ . This completes the proof. ∎

See 6.5

Proof.

Fix a non-empty set $D_{i,j,\ell}$ in $\widetilde{\mathcal{D}}$ for some $j>1$ . To prove the lemma, we show that the ratio of the maximum and the minimum probability according to $p$ in $D_{i,j,\ell}$ is at most $1+\epsilon^{\prime}$ . Consider two elements in $D_{i,j,\ell}$ namely $x$ and $y$ (if there is only one element in $D_{i,j,\ell}$ the claim is apparent). Without loss of generality assume $q(x)\leq q(y)$ . By definition of $D_{i,j,\ell}$ , $x$ and $y$ are in the same interval of $\mathcal{I}$ , so $r(x)$ and $r(y)$ are equal. Thus, we have:

[TABLE]

the second to last inequality is true, because we have $q(y)\geq q(x)>0$ . Also, the last inequality is true since both $x$ and $y$ are in $B_{j}$ . In the proof of Lemma 8 in [BFF*+*01], it is show that if the ratio of the probabilities in a set, in our case $D_{i,j,\ell}$ , is bounded by $(1+\epsilon)$ , then for all $x\in D_{i,j,\ell}$ , $\left|p(x)-(1/|D_{i,j,\ell}|)\right|$ is at most $\epsilon/|D_{i,j,\ell}|$ . This completes the proof. ∎

See 6.6

Proof.

Fix a set in $\widetilde{\mathcal{D}}$ , namely $D$ , which $p(D)$ and $q(D)$ are non-zero, we have the following:

[TABLE]

Therefore, we have:

[TABLE]

If we swap $p$ and $q$ in the above inequality, and replicate the equations, we have:

[TABLE]

Putting Equation 16 and Equation 17 together, we get:

[TABLE]

If at least one of $p(D)$ and $q(D)$ is zero, it implies:

[TABLE]

Hence, we have:

[TABLE]

∎

See 6.7

Proof.

We first consider a non-empty $D_{i,j,\ell}$ when $j=1$ . Since $j=1$ , $D_{i,j,\ell}$ is a subset of $B_{1}$ . For each $x\in D_{i,j,\ell}$ , $q(x)$ is at most $\epsilon^{\prime 2}/n$ . Also, $r$ is a $k$ -flat on $\mathcal{I}$ , and since $D_{i,j,\ell}$ is a subset of $I_{i}$ , for all $x\in D_{i,j,\ell}$ , $r(x)$ is the same. We denote this quantity, $r(x)$ , by $b$ . Here, we prove that either $q_{\alpha}(D_{i,j,\ell})$ is small, or $q_{\alpha}(D_{i,j,\ell})$ has to be close to uniform.

We have two cases. First, suppose $\alpha\cdot b$ is at most $\epsilon^{\prime}/n$ . In this case, $q_{\alpha}(D_{i,j,\ell})$ is at most $\epsilon^{\prime 2}|D_{i,j,\ell}|/n$ . Thus, the total weight of such sets, sum of $q_{\alpha}(D_{i,j,\ell})$ ’s, is at most $\epsilon^{\prime 2}$ . Second, assume $\alpha\cdot b$ is greater that $\epsilon^{\prime}/n$ . On the other hand, $q(x)$ is at most $\epsilon^{\prime 2}/n$ . These two facts implies for each $x$ in $D_{i,j,\ell}$ :

[TABLE]

Therefore, the $\ell_{2}^{2}$ -distance between $(q_{\alpha})_{|D_{i,j,\ell}}$ and the uniform distribution is bounded:

[TABLE]

Note that if $j$ is greater than one, the $\ell_{2}$ -distance between $(q_{\alpha})_{|D_{i,j,\ell}}$ and the uniform distribution is bounded by $\epsilon^{\prime}/\sqrt{|D_{i,j,\ell}|}$ as well. Therefore, if $p_{|D_{i,j,\ell}}$ is close uniform distribution, it has to be close to $(q_{\alpha})_{|D_{i,j,\ell}}$ as well. That is,

[TABLE]

Hence, given the discussion above there are three possibilities for $D_{i,j,\ell}$ . (i) $p(D_{i,j,\ell})$ is at most $\epsilon^{\prime}/(k.v)$ . Since $\ell_{1}$ -distance is at most $2$ , the total contribution of these sets in the sum below is at most $2\epsilon^{\prime}$ . (ii) $q_{\alpha}(D_{i,j,\ell})$ is at most $\epsilon^{\prime 2}|D_{i,j,\ell}|/n$ , so the total contribution of these sets is at most $2\epsilon^{\prime 2}$ . (iii) $\|p_{|D_{i,j,\ell}}-(q_{\alpha})_{|D_{i,j,\ell}}\|_{1}$ is at most $2.42\,\epsilon^{\prime}$ .

[TABLE]

Hence, the proof is complete. ∎

See 6.8

Proof.

First, consider a degenerate case. If $\alpha=0$ , then the claim is trivially true by the triangle inequality: $\|p-q\|_{1}\leq\|p-\hat{p}\|_{1}+\|\hat{p}-q\|_{1}\leq 3\epsilon^{\prime}\,.$ Thus, assume $\alpha>0$ .

For now, consider the case that there exists $x$ such that $f(x)$ is not zero, so $\sum_{x}f(x)$ is greater than zero. First, we show that since $\hat{p}$ is close to the mixture of $q$ and $f$ , the sum of the $f(x)$ ’s has to be close to one. That is,

[TABLE]

We define $r:[n]\rightarrow[0,1]$ to be the normalization of $f$ for which $r(x)=f(x)/\sum_{x}f(x)$ for all $x$ in the domain. If $f$ is a $k$ -flat function, then $r$ will be a $k$ -flat distribution. Now, we show that the mixture of $q$ and $f$ is close to the mixture of $q$ and $r$ with mixture parameter $\alpha$ .

[TABLE]

where the last inequality is due to Equation 18. Moreover, by the triangle inequality, we have:

[TABLE]

Now, assume $f(x)$ is zero for all $x$ in $[n]$ . We show that if we set $r$ to be the uniform distribution over $[n]$ , the same result holds. First, observe that the uniform distribution is a $k$ -flat distribution for any $k\geq 1$ . Then we show that $(1-\alpha)\,q+\alpha r$ is $4\epsilon^{\prime}$ -close to $p^{\prime}$ . Since $r(x)=1/n$ for all $x$ in $[n]$ ,

[TABLE]

On the other hand, since $\hat{p}$ is $2\epsilon^{\prime}$ -close to $(1-\alpha)\,q$ , one can show that $\alpha$ is at most $\epsilon^{\prime}$ .

[TABLE]

Therefore, whether $\sum_{x}f(x)$ is zero or not, there exists a $k$ -flat distribution, $r$ for which $\|\hat{p}-((1-\alpha)\,q+\alpha\,r)\|_{1}$ is at most $4\,\epsilon^{\prime}$ . Since $p^{\prime}$ is $\epsilon^{\prime}$ -close to $p$ , and by triangle inequality, we have:

[TABLE]

which concludes the proof. ∎

7 Lower bounds

In this section, we present lower bounds for testing mixtures in different settings discussed earlier.

\LBBigness

Proof.

We prove by showing a reduction from mixture testing to testing bigness property of distributions. A distribution called $T$ -big if the probability of any domain element is at least $T$ [AGP*+*19]. In addition, they showed there exist two constant parameters $\epsilon$ and $\beta$ and two family of distributions, namely $\mathcal{F}^{+}$ and $\mathcal{F}^{-}$ , such that the following is true

•

All distribution in $\mathcal{F}^{+}$ are $1/(\beta n)$ -big.

•

All distribution in $\mathcal{F}^{-}$ are $\epsilon$ -far from being $1/(\beta n)$ -big. Moreover, all the probability of each element according to the distributions is either zero or at least $1/(\beta n)$ .

•

Using $o(n/\log n)$ samples from a distribution in the families, no algorithm can distinguish whether the distribution was from $\mathcal{F}^{+}$ or $\mathcal{F}^{-}$ with probability at least $2/3$ .

Let $\epsilon=1/\beta$ . We show that any algorithm that can test mixtures as described in theorem, can distinguish $\mathcal{F}^{+}$ and $\mathcal{F}^{-}$ with high probability.

First, we show that for any $1/(\beta n)$ -big distribution, denoted by $p^{+}$ , there exists distribution $\eta$ such that $p^{+}$ is a mixture of $\eta$ and uniform distribution, meaning $p^{+}=\alpha\,\eta+(1-\alpha)\,\mathcal{U}$ for $\alpha=1/\beta$ . Let $\eta$ assign the following probability to the $i$ th element of the domain:

[TABLE]

It is not hard to see that $\eta$ as defined above is a probability distribution. Since $p$ is $1/(\beta n)$ -big distribution, all the $p(i)$ are at least $1/(\beta n)$ , so all the $\eta(i)$ ’s are non-negative. Also, $\sum_{i}\eta(i)=1$ . Clearly, $p^{+}$ is a mixture in the form $\alpha\,\eta+(1-\alpha)\,\mathcal{U}$ for $\alpha=1/\beta=\epsilon$ .

Note that for any distribution $p^{-}$ in $\mathcal{F}^{-}$ there is at least one element (in fact many elements) that has probability zero. Otherwise, all elements would have probability at least $1/(\beta n)$ , and the distribution would be big. On the other hand, any distribution that is mixed with uniform with parameter $\alpha<1$ cannot have any zero probability element. Thus, $p^{-}$ is not a mixture of the form $\alpha\,\eta+(1-\alpha)\,\mathcal{U}$ when $\alpha\neq 1$ .

Thus, any algorithm that can test mixture property as defined in the theorem has to accept $p^{+}$ and reject $p^{-}$ . However, we know this is not possible unless the algorithms gets $\Omega(n/\log n)$ samples. This completes the proof. ∎

Proposition 1.

When we have sample access to $q$ and $p$ , any closeness tester in the presence of uniform noise $\Omega\left(\max\left(n^{2/3}/\epsilon^{4/3},\sqrt{n}/\epsilon^{2}\right)\right)$ samples.

Proof.

First, note that one can reduce testing uniformity to this problem by setting $q$ equal to the uniform distribution. Therefore, it requires at least $\Omega(\sqrt{n}/\epsilon^{2})$ samples by the lower bound for uniformity testing shown in [Pan08].

Now, we establish that $\Omega(n^{2/3}/\epsilon^{4/3})$ many samples is also required. Without loss of generality, assume $\epsilon\geq 4^{3/4}/n^{1/4}$ . Otherwise $\sqrt{n}/\epsilon^{2}$ would be the dominating term in the lower bound up to a constant factor. To prove the lower bound, we use two distributions (and any random relabeling of them) used in proving lower bounds for testing closeness of distributions [BFR*+*13, VV17b, VV17a, CDVV14]. More precisely, we define two distributions $p^{*}$ and $q^{*}$ such that distinguishing $(p^{*},q^{*})$ and $(q^{*},q^{*})$ (and any random relabeling of them) requires $\Omega(n^{2/3}/\epsilon^{4/3})$ samples. On the other hand, we show that any $\Omega(q^{*},\mathcal{U},\epsilon)$ -mixture tester has to distinguish $(p^{*},q^{*})$ and $(q^{*},q^{*})$ . Thus, the statement of the proposition is concluded.

Let $a=4\epsilon/n$ and $b=\epsilon^{4/3}/n^{2/3}$ . Consider three disjoint subset of domain elements $[n]$ , namely $A$ , $B$ , and $C$ each of size $(1-\epsilon)/b$ , $\epsilon/a$ , and $\epsilon/a$ respectively. Let $p$ and $q$ be the following distributions:

[TABLE]

Note that $p^{*}$ is $\epsilon$ -far from any mixture distribution of $q^{*}$ and $\mathcal{U}$ with parameter $\alpha\in[0,1]$ , since

[TABLE]

Clearly, in the case where $p=q^{*}$ and $q=q^{*}$ , $p$ is a mixture of $q$ and $\mathcal{U}$ with mixture parameter $\alpha=0$ , and in the case where $p=p^{*}$ and $q=q^{*}$ , $p$ is $\epsilon$ -far from any mixture distribution of $q^{*}$ and $\mathcal{U}$ . Thus, a $(q,\mathcal{U},\epsilon)$ -mixture tester has to distinguish between $(q^{*},q^{*})$ and $(p^{*},q^{*})$ . By proposition 4.1 in [CDVV14], we know that this task requires $\Omega(n^{2/3}/\epsilon^{4/3})$ samples. ∎

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[ABR 16] Maryam Aliakbarpour, Eric Blais, and Ronitt Rubinfeld. Learning and testing junta distributions. In COLT , pages 19–46, 2016.
2[ADK 15] Jayadev Acharya, Costantinoss Daskalakis, and Gautam Kamath. Optimal testing for properties of distributions. In NIPS , pages 3591–3599, 2015.
3[AGP + 19] Maryam Aliakbarpour, Themis Gouleakis, John Peebles, Ronitt Rubinfeld, and Anak Yodpinyanee. Towards testing monotonicity of distributions over general posets. In Proceedings of the Thirty-Second Conference on Learning Theory, COLT , pages 34–82, 2019.
4[AJOS 14] Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda T. Suresh. Sublinear algorithms for outlier detection and generalized closeness testing. In IEEE ISIT , pages 3200–3204, 2014.
5[Bat 01] Tugkan Batu. Testing Properties of Distributions . Ph D thesis, Cornell University, 2001.
6[BDKR 02] Tugkan Batu, Sanjoy Dasgupta, Ravi Kumar, and Ronitt Rubinfeld. The complexity of approximating entropy. In STOC , pages 678–687, 2002.
7[BFF + 01] Tugkan Batu, Eldar Fischer, Lance Fortnow, Ravi Kumar, Ronitt Rubinfeld, and Patrick White. Testing random variables for independence and identity. In FOCS , pages 442–451, 2001.
8[BFR + 13] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick White. Testing closeness of discrete distributions. JACM , 60(1):4:1–4:25, 2013.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Testing Mixtures of Discrete Distributions

Abstract

1 Introduction

Main contributions.

2 Preliminaries

Background.

Mixture testing problems.

3 An overview of our results and techniques

3.1 Testing identity in the presence of known noise

Theorem 3.1**.**

The learner.

Assessing the distance between ppp and qαq_{\alpha}qα​.

3.2 Testing closeness in the presence of noise

Theorem 3.2**.**

3.3 Testing identity in the presence of kkk-flat noise

3.4 Lower bounds

4 Identity testing of mixtures in the presence of known noise

4.1 The learner

Lemma 4.1**.**

Proof.

4.2 Reshaping the distributions

Lemma 4.2**.**

Proof.

4.3 The mixture testing algorithm

Theorem 4.3**.**

Proof.

5 Testing mixtures in the presence of noise that is accessible via samples

5.1 Finding candidates

Theorem 5.1**.**

Proof.

Lemma 5.2**.**

Lemma 5.3**.**

Lemma 5.4**.**

Lemma 5.5**.**

5.2 Mixture closeness tester

Theorem 5.6**.**

Proof.

5.3 Proofs for Section 5.1 and Section 5.2

Proof.

Proof.

Proof.

Proof.

Lemma 5.7**.**

Proof.

6 Testing under kkk-flat noise

6.1 Preliminaries

Definition 6.1** (Similar to [BFF*+*01]).**

Definition 6.2**.**

The problem of testing identity in the presence of kkk-flat noise.

6.2 The algorithm

Theorem 6.3**.**

Proof.

Lemma 6.4**.**

Lemma 6.5**.**

Lemma 6.6**.**

Lemma 6.7**.**

Lemma 6.8**.**

A faster algorithm.

6.3 Proofs for Section 6.2

Proof.

Proof.

Proof.

Proof.

Proof.

7 Lower bounds

Proof.

Proposition 1**.**

Proof.

Theorem 3.1.

Assessing the distance between $p$ and $q_{\alpha}$ .

Theorem 3.2.

3.3 Testing identity in the presence of $k$ -flat noise

Lemma 4.1.

Lemma 4.2.

Theorem 4.3.

Theorem 5.1.

Lemma 5.2.

Lemma 5.3.

Lemma 5.4.

Lemma 5.5.

Theorem 5.6.

Lemma 5.7.

6 Testing under $k$ -flat noise

Definition 6.1 (Similar to [BFF+01]).

Definition 6.2.

The problem of testing identity in the presence of $k$ -flat noise.

Theorem 6.3.

Lemma 6.4.

Lemma 6.5.

Lemma 6.6.

Lemma 6.7.

Lemma 6.8.

Proposition 1.