Property Testing of Joint Distributions using Conditional Samples
Rishiraj Bhattacharyya, Sourav Chakraborty

TL;DR
This paper introduces a subcube conditional sampling model for testing properties of joint distributions, achieving polynomial sample complexity in the dimension, unlike traditional models with exponential complexity.
Contribution
It proposes a new subcube conditional sampling framework and develops algorithms with polynomial sample complexity for property testing of joint distributions.
Findings
Polynomial sample complexity for identity testing algorithms
Efficient algorithms for testing against known and unknown distributions
Avoidance of the curse of dimensionality through a chain rule technique
Abstract
In this paper, we consider the problem of testing properties of joint distributions under the Conditional Sampling framework. In the standard sampling model, the sample complexity of testing properties of joint distributions is exponential in the dimension, resulting in inefficient algorithms for practical use. While recent results achieve efficient algorithms for product distributions with significantly smaller sample complexity, no efficient algorithm is expected when the marginals are not independent. We initialize the study of conditional sampling in the multidimensional setting. We propose a subcube conditional sampling model where the tester can condition on an (adaptively) chosen subcube of the domain. Due to its simplicity, this model is potentially implementable in many practical applications, particularly when the distribution is a joint distribution over for some…
| Problems | Conditional Sampling | Traditional Sampling | |
|---|---|---|---|
| Upper Bound [This paper] | Lower Bound | Upper and Lower Bound | |
| Identity to the | |||
| uniform distribution | [CDKS17] | [Pan08] | |
| Identity to a | |||
| known distribution | [CDKS17] | [VV14] | |
| Identity between two | |||
| unknown distributions | [CDKS17] | [oCDVV14] | |
| Identity to a | |||
| product distribution | [CDKS17] | [ACK15, DK16] | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Complexity and Algorithms in Graphs · Privacy-Preserving Technologies in Data
Property Testing of Joint Distributions using Conditional Samples
Rishiraj Bhattacharyya NISER Bhubaneswar, HBNI, India, [email protected]
Sourav Chakraborty Chennai Mathematical Institute Chennai, India, and CWI Amsterdam, The Netherlands. [email protected]
Abstract
In this paper, we consider the problem of testing properties of joint distributions under the Conditional Sampling framework. In the standard sampling model, the sample complexity of testing properties of joint distributions is exponential in the dimension, resulting in inefficient algorithms for practical use. While recent results achieve efficient algorithms for product distributions with significantly smaller sample complexity, no efficient algorithm is expected when the marginals are not independent.
We initialize the study of conditional sampling in the multidimensional setting. We propose a subcube conditional sampling model where the tester can condition on an (adaptively) chosen subcube of the domain. Due to its simplicity, this model is potentially implementable in many practical applications, particularly when the distribution is a joint distribution over for some set .
We present algorithms for various fundamental properties of distributions in the subcube-conditioning model and prove that the sample complexity is polynomial in the dimension (and not exponential as in the traditional model). We present an algorithm for testing identity to a known distribution using -subcube-conditional samples, an algorithm for testing identity between two unknown distributions using -subcube-conditional samples and an algorithm for testing identity to a product distribution using -subcube-conditional samples.
The central concept of our technique involves an elegant chain rule which can be proved using basic techniques of probability theory yet powerful enough to avoid the curse of dimensionality.
1 Introduction
Property Testing of Distributions. The boom of Big Data Analytics has rejuvenated the well-studied area of hypothesis testing over unknown distributions. In Computer Science, the study of this type of problems was initiated by Batu, Fortnow, Rubinfeld, Smith, and White [BFR*+*13] under the framework of “Property Testing” [GGR98, RS96] In this framework, the “tester” draws independent samples from the distribution, and decides whether the distribution satisfies a specific property (null hypothesis) or is far from any distribution that satisfies (alternate hypothesis).
Several properties of probability distributions have been studied in this framework. Testing whether the distribution is uniform [BFF*+*01a, GR11, oCDVV14], testing identity between two unknown distributions (taking samples from both the distributions) [BFR*+*13, LRR13], testing independence of marginals of product distributions [BFF*+*01a] , estimating entropy [BDKR05] are a few of the numerous problems that have been studied in the literature. See [Can15b] for a survey on results related to distribution testing.
Unfortunately, from the modern data analytics point of view, the traditional framework of sampling yields impractical sample complexity. For example, testing if a distribution over a set of elements is uniform requires samples from the distribution. The other problems mentioned above have sample complexity at least this high and in some cases, almost linear in [RRSS09, VV11, Val11].
Conditional Sampling
To remedy this situation, Chakraborty et al. [CFGM16] and Canonne, Ron, and Servedio [CRS15] proposed a different model called conditional sampling, which has emerged as a powerful tool for testing properties of probability distributions. In this model, the testers are allowed to sample according to the distribution conditioned on any specific subset of the domain. If the distribution, , is over the domain , the tester can submit any subset and receive a sample with probability , where is the probability of occurring when a sample is drawn from the distribution .
[CFGM16, CRS15] proved that in the conditional sampling model, testing uniformity, testing identity to a known distribution, and testing any label-invariant property of distributions is easier than with the unconditional sampling model. Specifically, one can get an algorithm for testing uniformity using conditional samples (conditioning on arbitrary subsets of size ) [CRS15] . Falahatgar et al. [FJO*+*15], improving an upper bound of in [CRS15], showed that testing identity to a known distribution could also be done using conditional samples. They also showed that there exists an algorithm to test identity between two unknown distributions on using conditional samples. In [ACK15], Acharya, Canonne, and Kamath showed a lower bound of for testing the equivalence of two unknown distributions.
In the conditional sampling model, the sample complexity depends on the structure of the condition, i.e., the structure of the subsets (of the domain) on which the distribution is conditioned for drawing samples. Naturally, if there is no restriction on the condition, the tester can sample conditioned on arbitrary subsets, and the sample complexity improves. In [CRS15], the authors presented an algorithm for testing whether a distribution over is uniform, with sample complexity when conditioning on arbitrary subsets of size . However, when the condition set was structured and restricted to intervals, they proved a lower bound of sample complexity . In [Can15a], Canonne showed that conditioning on interval improves the query complexity of monotonicity testing. Hence it is important to consider the plausible restrictions on the conditions arising from the structure of the domain.
While [CRS15] studied some of the restrictions of the conditions, there are many more restrictions, which arise from the structure of the domain and/or arise from other applications, which are yet to be studied. One such important case is when the domain is a Cartesian product of set and one is allowed to condition on the Cartesian product of subsets, but not on arbitrary subsets of the domain.
Testing Joint Distributions: Subcube Conditioning
In practice, data are often multi-dimensional. In Cryptography, the keys are often defined over . Solutions to SAT formulae are over as well. On the other hand, the Lottery Tickets are defined over for some (each ticket contains numbers, each from the set ). Data analysts often get data of million dimensions (features). With the higher dimension, comes the “curse of dimensionality.” The sample complexity of the testers is exponential in dimension [ADK15, BFF*+*01b, DK16], prohibiting practical applications. Very recently, [DDK18] considered testing higher dimensional structured distributions modelled using Markov Random Fields and achieved polynomial (in the dimension) sample complexity under the Ising model. [CDKS17, DP17] considered testing properties of structured distributions using the probabilistic graphical model and achieved sublinear complexity for certain properties of Bayesian networks. However, all these results assume the distribution is structured and has certain properties. But for arbitrary distributions, testing with practical complexity remains a big concern.
One can be hopeful that using conditional sampling, testing properties of arbitrary joint distributions with practical complexity can be achieved. In that case, the assumptions are imposed on the sampling model. Finding a correct and natural sampling model is a challenge in itself. While joint distributions can also be viewed as a distribution over a larger domain, the marginals’ domains may differ. Hence sampling conditioned on arbitrary subsets (as used in [CRS15, CFGM16]) may not be feasible in real life.
In [CRS15], authors also considered structured conditioning, namely Icond (conditioning over an interval) and PCond (conditioning over a pair of points). Icond requires the domain to be well ordered. Moreover, for both cases, one should be able to sample from arbitrary intervals. For a joint distribution, the natural ordering of the domain is a pair; it involves ordering in the dimensions coupled with ordering in the individual domains. For such an ordering, an arbitrary interval is required for the Icond tester need not be succinctly encodable and remains impractical.
1.1 Our Results
In this paper, we propose the subcube conditioning model and analyze property testing of joint distributions in that model.
Informally, the subcube conditioning model can be described in the following way. Let be the domain of the distribution . The Subcube Conditioning Oracle accepts and constructs as the condition set. The oracle returns a vector , where each , with probability . If , we assume the oracle returns an element from uniformly at random. We will call these kinds of samples subcube-conditional-samples and the corresponding sample complexity subcube-conditional-sample complexity. There is no restriction on the individual s. They may be unstructured or structured as pairs or intervals as used in [CRS15, CFGM16].
Motivation of SubCube conditioning
We believe the subcube conditional sampling model is mathematically interesting in itself. Every Boolean function can be modelled as a subgraph of a hypercube. Testing a property of a Boolean function translates to testing some property of the resulting subgraph. The conditional sampling model is equivalent to sampling over the edges of such subgraph, i.e., fixing some vertices, sampling over the edges, and checking the properties of the adjacent vertices. We argue sampling over the hypercube arises naturally in many areas.
Database Query. A typical “SELECT” query to a database often looks like SELECT field1 WHERE field2= cond1 and field2 = cond2. The response to such a query is all the tuples which satisfy cond1 and cond2. Sampling over such tuples is indeed conditional sampling.
Side Channel Cryptanalysis. In modern Cryptography, schemes are often “proven” secure (no efficient attack algorithm exists) under the assumption that the keys, internal randomness, and internal memory are inaccessible to the adversary. However, in practice, Cryptographic schemes are deployed in a wide variety of devices, specifically hand-held devices and smart cards. This situation leads to the “side channel attacks” where tampering with the keys or internal randomness is feasible. Specifically, the cryptanalytic techniques of fault attacks fix/modify some bits and test the resulting distributions. The subcube conditioning model captures this attack scenario (fixing some bits and testing on the resulting subcube).
Our results in this paper can be viewed as proof that “indistinguishability” with uniform (in fact any known distribution) cannot be proven if an adversary can tamper with the internal state. 111While this result is folklore in Cryptography, the subcube conditioning may be considered as the benchmark model while analyzing the efficiency of a fault attack.
Verification of Random SAT solutions. In software verification and related areas, random solutions to SAT problems are often used as a backbone. However, testing whether the solution that one algorithm generates is indeed uniform is a very important problem. Unfortunately, the standard algorithms require impractical complexity. Recently, Chakraborty et al. [CM16] used the conditional sampling model to get a practically deployable solution. The model of subcube conditioning would be very effective to this problem as one natural conditioning technique is to fix some variables of the SAT equation and then test the solution’s distribution.
Recently [GTZ17], has significantly improved the runtimes of sublinear algorithms for k means clustering and weight estimation of minimum spanning tree using conditional samples. We believe the subcube conditioning can be used in this setting as well.
We remark that the idea of subcube conditioning has also been mentioned in the literature related to property testing. In fact, analysis of joint distributions using subcube conditioning was posed as a natural open problem in [CRS15].
Our Results
We focus on four fundamental properties of distributions: given two joint distributions and over we would like to test, using subcube-conditional-samples, if (a) is uniform, (b) is identical to (when is known in advance), (c) is identical to (when is not known in advance and has to be accessed using conditional samples), and (d) is a product distribution. We have the following four theorems:
Theorem 1.1**.**
*(Informal) Let be a probability distribution over . There exists an algorithm for testing if is uniform, using subcube-conditional-samples.222 hides a polynomial function of and . *
Theorem 1.2**.**
*(Informal)
Let be a known probability distribution over the set . Let be an unknown distribution over . There exists an algorithm to test identity of with using subcube-conditional-samples. 2*
Theorem 1.3**.**
*(Informal)
Let be unknown distributions over . There exists an algorithm to test if and are identical using subcube-conditional-samples from both and . 2*
Theorem 1.4**.**
*(Informal)
Let be a probability distribution over the set . There exists an algorithm to test whether is a product distribution using subcube-conditional-samples. 2*
Comparison to Previous Results
While conditional sampling has been studied in a number of articles in the recent past, and although subcube conditioning is a very natural model (that is also discussed in [CRS15]), as far as we understand, this is the first formal study on subcube conditioning. One of the main reasons for the lack of literature in this area is that the classical setting was not well studied either, till recently. Recently in [CDKS17] Canonne et al. studied the problem of testing properties of joint distributions over the domain . For example, for the fundamental problem of testing if the distribution is uniform, they observed that if the distribution is a product distribution (that is, the marginals are independent), then one needs samples. But if the distributions are not independent, then in the worst case, samples are necessary.
In comparison, we show that only subcube-conditional samples are necessary in the worst case, so we have an exponential improvement in the sample complexity. Also, it is interesting to note that the sample complexity for uniformity testing in the subcube model is independent of . This shows the power of subcube conditional samples and gets the query complexity to a more practical level. Also, from [CDKS17] we know that conditional samples are necessary since, in the case of product distributions, conditional samples give no additional power over standard samples.
A list of our results and comparison to previous results on standard sampling algorithms are given in Table 1.
Overview of Our Technique
Let us start with the problem of testing if a given distribution is uniform. Let be a distribution over with marginals .
The simplest case is when is a product of independent distributions. That is, ’s are independent but not necessarily identical. But if is -far from uniform , one expects to find at least one which is -far from uniform. Then one can use any tester over if is far from uniform, which should make at most poly() traditional queries. In fact, when is a product distribution over , [CDKS17] show that the uniformity and identity can be tested using unconditional samples. As the marginals of are independent and over , subcube-conditional-sampling is equivalent to unconditional sampling followed by projections, and hence subcube-conditional samples do not give any additional power in this setting.
But if the ’s are not independent, then it is possible that all the individual marginals are uniform, but still, the is -far from uniform. As has been observed in [CDKS17], any algorithm (using unconditional sampling) requires queries. To circumvent this barrier, we need to use conditional samples. We define a notion of “conditional distance”. We show that there exists at least one such that the expected “conditional distance” of th marginal from uniform is more than \epsilon/\mbox{poly(n)}. Thus it is enough to test for all if the th marginal is \epsilon/\mbox{poly(n)}-far from uniform. We can use the testers from [CRS15, CFGM16] to test exactly that condition using poly() subcube-conditional samples. The central idea of the correctness of the algorithm is the correct definition of the “conditional distance” and the “chain rule” that proves that such an exists. Although the proof of the “chain rule” (given in Section 3) is simple in hindsight, it is a powerful tool that acts as the central backbone for all our upper-bound proofs. Moreover, it gives the flexibility of using an adaptive or non-adaptive tester over .
1.2 Organization of the paper
In Section 2, we define the notion of conditional distance and SubCube Conditioning. The chain rule is described in Section 3. In Section 4 we present the identity testers and the derived uniformity tester. In Section 5, the tester for testing identity between two unknown distributions is presented. In Section 6, the tester for the independence of marginals is described. In Appendix A we present a lower bound of for testing identity to the uniform distribution. This lower bound was proved independently of [CDKS17] and although our lower bound is weaker than their lower bound of , we feel that our techniques can be of independent interest.
2 Notations and Preliminaries
If is a set, denotes the size of the set. If is a vector of length , denotes the element of . denotes the substring of first elements of ; . We denote the -th harmonic number by .
For any set , we denote by the uniform distribution with support . In most cases, the support of the distribution would be clear from the context and in that case, we would drop the subscript and use as the uniform distribution over the support in question.
If is a distribution with support , for any , we will denote by the probability the occurs when a random sample is drawn from according to . If is a joint distribution, denotes the marginal distribution of .
If is a distribution over with the marginals and if the marginals are independent (that is, is a product distribution) then we would write .
Total Variation Distance. Let be two distributions with support . The variation distance between and denoted by is defined as
[TABLE]
We say and are -far (or is -far from ), when
If is a distribution with support and , then by , we denote the distribution over the support . For any , the probability that occurs when a random sample is drawn from (according to the distribution ) is given by
[TABLE]
Hellinger Distance. Let be two distributions with support . The Hellinger distance between and denoted by is defined as
[TABLE]
Hellinger distance has some nice properties and is useful for bounding lower and upper bounding variation distance.
[TABLE]
Also for any two product distributions and
[TABLE]
Conditional Distance. Let be two distributions over . Let . The variation distance between and conditioned on (denote by ) is defined as
[TABLE]
We say and are -far, conditioned on , when
Subcube Conditioning. In this paper, we work with joint distributions; for some set . We consider conditional distance under the condition on where each .
Let be a distribution over and be a random variable distributed according to . denotes the distribution over where for every ,
[TABLE]
Let for some . denotes the marginal distribution when the first random variables are fixed to .
[TABLE]
Definition 2.1**.**
Let be two distributions over . The conditional marginal distance of and conditioned on is given by
[TABLE]
The average conditional distance between and is defined by
[TABLE]
The SubCube Condition Model
Let be a distribution over . A subcube conditional oracle for , denoted , takes as input a sequence of sets , . Let be the product set . The oracle returns an element with probability independently of all previous calls to the oracle.
An tester for a property with conditional sample complexity is a randomized algorithm, that receives , and oracle access to , and operates as follows.
In every iteration, the algorithm (possibly adaptively) generates a set , based on the transcript and its internal coin tosses, and calls the conditional oracle with to receive an element , drawn according to the distribution conditioned on . 2. 2.
Based on the received elements and its internal coin tosses, the algorithm accepts or rejects the distribution . 3. 3.
The algorithm makes at most queries to , where can depend on and .
If satisfies , then the algorithm must accept with probability at least , and if is -far from all distributions satisfying , then the algorithm must reject with probability at least .
We will call such a tester an -tester. For example an Uniformity-tester is an tester that tests if the given distribution is uniform, an Identity-tester is an tester that tests if the given distribution is identical to a known distribution and an Product-tester is an tester that tests if the given distribution is a product distribution or far from all the product distributions.
3 Chain Rule of Conditional Distances
Let and be two distributions over , and let and be the corresponding random variables. For any , we denote by and the distributions of the th marginals of and respectively.
Lemma 3.1** (Chain Rule of Conditional Distances).**
Let and be two distributions over , and let and be two random variables with distribution and respectively. Then the following holds.
[TABLE]
Proof of Lemma 3.1:.
Let .
Let . Recall that denotes the substring of first elements of .
[TABLE]
Now, the second term reduces to,
[TABLE]
The second equality follows from the fact that for each , 333If is outside of the support of , like in [CFGM16], we can define the conditional probability to be uniform over Hence,
[TABLE]
Solving the recursion, we get the lemma. ∎
Arranging the marginals by the increasing order of the average conditional distance, we get the immediate corollary.
Lemma 3.2**.**
If , then there exists a such that
[TABLE]
Proof of Lemma 3.2.
Without loss of generality let be indices such that
[TABLE]
We will need the following claim.
Claim 3.3**.**
There exists such that
[TABLE]
Let be the index from Claim 3.3. We put to get . Clearly
[TABLE]
∎
Proof of Claim 3.3.
If no such exists, then
[TABLE]
which contradicts the distance assumption in Lemma 3.2. ∎
4 Testing Identity with a known distribution
In this section, we present an identity tester of Sample complexity . We recall the following result proved in [FJO*+*15].
Lemma 4.1**.**
[FJO*+*15]** Let be a known distribution over . Given and and a distribution over there is an adaptive -SubCond Identity Tester with conditional sample complexity . In other words, there is a tester that draws conditional samples and
- •
if , then the tester will accept with probability , and
- •
if then the tester will reject with probability .
Let be a known distribution over , be an unknown distribution over that can be accessed via oracle, and be the target distance. The following algorithm tests the identity of with . We use the identity tester BasicIDTester over guaranteed by Lemma 4.1 as a subroutine.
Theorem 4.2**.**
Given any , Algorithm 1 is an -SubCond Identity Tester for joint distributions with conditional sample complexity of where hides a polynomial function of .
Note 4.3**.**
For any , one can obtain an -SubCond Identity Tester by standard techniques of error reduction. The query complexity would increase by a factor of .
4.1 Proof of Theorem 4.2
Fix . In Algorithm 1, Step 14 queries BasicIDTester. BasicIDTester needs conditional samples for testing whether . To answer a conditional query with condition for the distribution , we set for , , and for , and query the SubCond oracle with the condition . This correctly simulates the conditional oracle required by the underlying identity tester. Thus Algorithm 1 is a SubCond Tester.
4.1.1 Sample Complexity of Algorithm 1
By Lemma 4.1, a query to requires samples. Here hides polylogarithmic factors of including the factors due to .
For each index in , the sample complexity is
[TABLE]
Here hides some polylogarithmic function of and . As , the expression can be bounded as
[TABLE]
The last equality holds true as .
The size of is . Adding over all possible , we get the total sample complexity
[TABLE]
4.1.2 Correctness of the Algorithm 1
Completeness. We will show that if , the algorithm will reject with probability at most .
Algorithm 1, rejects if there exists and a sampled the underlying Identity Tester rejects in the Step 14.
Suppose and are identical. Then for all , is identical to . For each query, BasicIDTester will reject in Step 14 with probability at most . By union bound, the probability that the algorithm will reject is at most
[TABLE]
Soundness. Now, we prove the soundness of the Algorithm 1. Let be a distribution over and . We shall show that Algorithm 1 rejects with a probability of at least .
Let
[TABLE]
Let be the integer guaranteed by Lemma 3.2, such that . Note, . For each , for each define
[TABLE]
We require the following lemma based on Levin’s economical work investment strategy [Gol17].
Lemma 4.4**.**
Let be a distribution over , and is -far from uniform. Let be a random variable with distribution . Let be a random sample drawn from according to the distribution . Let and .
Then for all , there exists ,
[TABLE]
( Proof of Lemma 4.4.).
From Lemma 3.2, for all index
[TABLE]
Fix . Let us define
[TABLE]
By construction, . We shall prove that there exists such that . Suppose, towards contradiction, for all , . Then
[TABLE]
In the last inequality we used the fact that which is less than .
∎
By Lemma 4.4, there exists , such that,
[TABLE]
Let be the set of indices sampled in Step 3 in the iteration. If Algorithm 1 fails to reject , one of the following three cases happens.
No index from was sampled in . Specifically, . The probability of this event is
[TABLE] 2. 2.
For all index , for each , all the sampled ’s are from the set . The probability of this event is
[TABLE] 3. 3.
For all index , for each , for all the sampled , underlying identity tester fails to reject. The probability of such an event is at most , which is less than for .
Hence, the probability that Algorithm 1 fails to reject is at most .
This completes the proof of Theorem 4.2. ∎
4.2 Uniformity Tester for Arbitrary Joint Distribution
If we set to be the uniform distribution, then Algorithm 1 gives us a Uniformity Tester. Hence, we get the following as a corollary of Theorem 4.2.
Theorem 4.5**.**
Given any , there exists an -SubCond Uniformity Tester for any joint distribution with conditional sample complexity of where hides a polynomial function of .
5 Identity Testing between Unknown Joint Distributions
In this section, we present Algorithm 2 to test identity when both and are unknown. The first change, from Algorithm 1, we need to make is in Step 12. In this case, we can no longer sample on our own. However, we can query to get . Secondly, instead of Algorithm BasicIDTester, we need to use Algorithm BasicUnknown guaranteed by the following lemma.
Lemma 5.1**.**
[FJO*+*15]** Given and and distributions over there is an -Identity Tester with conditional sample complexity . In other words, there is a tester that draws independent conditional samples and
- •
if , then the tester will accept with probability , and
- •
if then the tester will reject with probability .
To prove the correctness of Algorithm 2, we note that, in the chain rule, the expectation is over only one distribution. Hence it is sufficient to (unconditionally) query only to get , and apply Lemma 3.2. The rest of the proof is exactly the same as in Section 4.
Sample Complexity of Algorithm 2
By Lemma 5.1, each invocation of BasicUnknown with parameter , requires samples. As in the case for Algorithm 1, for each index in , the sample complexity is . Hence, the total sample complexity of Algorithm 2 is
[TABLE]
Theorem 5.2**.**
Given , Algorithm 2 is an -SubCond Identity Tester for two unknown joint distributions with conditional sample complexity of where hides a polynomial function of .
6 Testing Independence of Marginals
Let be a probability distribution over . In this section, we present an algorithm to test whether is a product distribution; i.e., whether all the marginals of are independent or is far from all the product distributions.
Define to be the product of marginals of .
[TABLE]
By definition, the marginal distributions are exactly the marginal distributions . If is -far from all the product distributions, it is -far from . Using the chain rule (Lemma 3.1),
[TABLE]
Therefore, we need to test whether there exists , such that the marginal distribution is far (on average) from the conditional marginal distribution . As both and is distributed over , we can again use BasicUnknown tester from [FJO*+*15], where identity between two unknown distributions is tested using sample complexity. The only thing left is to sample according to . Such a can be sampled by taking an unconditionally sampled string and selecting the first bit of that string. The rest of the algorithm is exactly the same as in Algorithm 2.
Theorem 6.1**.**
For any , there exists an - SubCond Product Tester for joint distributions with conditional sample complexity of , where hides a polynomial function of
The proof of Theorem 6.1 follows directly from Theorem 5.2, and the observation that in this particular case, the (conditional) samples for can be produced by conditioning only on the index of .
7 Conclusion
In this paper, we analyzed property testing of joint distributions in the conditional sampling model. We considered the natural subcube conditioning and presented testers to test uniformity, identity with a known distribution, identity with an unknown distribution, and independence of marginals of query complexity polynomial in the dimension, thus avoiding the curse of dimensionality.
Acknowledgements
The authors would like to thank the anonymous reviewers for their insightful suggestions and comments, which significantly improved the paper. In particular, the authors would like to thank the first reviewer of the ToCT submission for suggesting the use of Levin’s economic work strategy, which resulted in a speedup of all our algorithms by a factor of .
Rishiraj is supported by SERB ECR/2017/001974.
Appendix A A Weaker Lower Bound with Simple Proof
Theorem A.1**.**
For any any Uniformity-Tester has subcube-conditional sample complexity . The lower bound holds even for the case when the domain is and the given distribution is a product of independent (though not necessarily identical) distributions.
Proof.
Let be a product distributions over the domain with marginals . So . Note that since the are independent, if then conditioning on does not affect the samples we get from a . Also, since the are all distributions over a two-element set (namely ), conditioning on any subset of also of no use. Thus drawing subcube-conditional-samples from is as good as drawing samples (without any conditioning) from .
So it is sufficient for us to prove that for any any Uniformity-Tester has sample complexity , when the domain is and the given distributions are product distributions.
The main idea of the proof is to use a standard technique from property testing where the following lemma is used. The following lemma has been rewritten in the language and context of this paper. A proof of the general statement of the lemma can be found in [Fis04, FNS04].
Theorem A.2**.**
Let be a property of distributions over that we want to test. Suppose is a distribution over all the distributions that satisfy the given property , and let be a distribution over all distributions that are -far from satisfying the property . Let be the distribution over outcomes of samples when the samples are drawn from a distribution that is drawn according to . Similarly, let be the distribution over outcomes of samples when the samples are drawn from a distribution , that is drawn according to the . If the variation distance between and is less than , then any -Tester for the property will have sample complexity more than .
In the context of our theorem, the property is “Uniformity”. So the distribution is the uniform distribution over the domain . Now let us define the distribution :
Let be the distribution over where is produced with probability and [math] produced with probability . And let be the distribution over where is produced with probability and [math] produced with probability .
Consider the set of distributions over which are a product of distribution each of which is either or . That is,
[TABLE]
Claim A.3**.**
Any is -far from uniform. That is, for any we have
[TABLE]
From Claim A.3 we see that all the distributions in are -far from uniform. Thus we can take the distribution as our distribution . If a distribution is drawn from or , samples from the distribution will give many -strings of length . Note that if a distribution is drawn from (that is, the distribution is the uniform distribution over ), then the distribution of the outcomes of samples is a uniform distribution over . So, by theorem A.2, it is enough to show that if is drawn from then the distribution of the outcomes (as a distribution over ) is -close to uniform.
Note that is a distribution drawn from we can think of as where each is independently and uniformly chosen from the set . Let be the distribution over when samples are drawn from . And now the following lemma completes the proof of Theorem A.1.
Lemma A.4**.**
If then
[TABLE]
∎
A.1 Proof of Claim A.3
Let . Without loss of generality, we will assume that all the ’s are the distribution . That is is produced with probability and [math] produced with probability . For simplifying notations, we will assume is produced with probability and [math] produced with probability .
Since we know , it is enough for us to prove . For any let be the probability of getting when drawn from . Note that the probability of getting when drawn from is .
By definition we have
[TABLE]
Now note that if has 1’s and 0’s then . So we have
[TABLE]
Now since for all so,
[TABLE]
The last inequality follows from the fact that . Now putting all the things together, we have
[TABLE]
If then from the above inequality, and the fact that , we have .
A.2 Proof of Lemma A.4
Let us start with a claim. We defer the proof of the claim to the end of this section.
Claim A.5**.**
If and be two distributions over and for all we have
[TABLE]
then we have
[TABLE]
Claim A.5 helps to upper bound the Hellinger distance in terms of the distance. Now let . And let be the distribution on that is obtained by drawing samples from . Clearly, . To prove that the variation distance of from uniform is less than , we will first show that the distance of from uniform is small, then using Claim A.5 we get that the Hellinger distance of from uniform is small. And then, we can show that if all the has a small Hellinger distance from uniform, then has a small Hellinger distance from uniform, which would give an upper bound on the variation distance of from uniform.
Now the following claim upper bounds the distance of from uniform.
Claim A.6**.**
For all and for all
[TABLE]
Or, in other words, for all if
[TABLE]
then
By definition of Hellinger distance and variation distance, we have
[TABLE]
Again we know that for any two product distributions and
[TABLE]
Thus we have
[TABLE]
From Equation 3 and Claim A.5 we have
[TABLE]
where, . So . From Claim A.6 we have that . So we have
[TABLE]
Thus if we have which is less than
A.2.1 Proof of Claim A.5
Let and . By definition
[TABLE]
Now . Now it is easy to verify that for all such that , we have
[TABLE]
So, from the above observation,
[TABLE]
Now since and so we have
[TABLE]
A.2.2 Proof of Claim A.6
Let has ’s and [math]’s. Since the is either the distribution with probability or distribution with probability , so the probability of appearing, when drawn from , is
[TABLE]
Using the inequality (holds for and ), we have
[TABLE]
The right-hand side of the above inequality is equal to . Thus we have
[TABLE]
For the upper bound, we shall use the following inequality. Let be such that . It holds that
[TABLE]
The above inequality can be easily proved using the following facts.
**When and **
- (a)
it holds that . 2. (b)
as it holds that . 2. 2.
When it holds that (can be proved using induction on ).
Since and , . Hence, for all ,
[TABLE]
[TABLE]
and thus
[TABLE]
Since and so we have
[TABLE]
And thus, we have
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[ACK 15] Jayadev Acharya, Clément L. Canonne, and Gautam Kamath. A chasm between identity and equivalence testing with conditional queries. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2015, August 24-26, 2015, Princeton, NJ, USA , pages 449–466, 2015.
- 2[ADK 15] Jayadev Acharya, Constantinos Daskalakis, and Gautam Kamath. Optimal testing for properties of distributions. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , pages 3591–3599, 2015.
- 3[BDKR 05] Tuǧkan Batu, Sanjoy Dasgupta, Ravi Kumar, and Ronitt Rubinfeld. The complexity of approximating the entropy. SIAM J. Comput. , 35(1):132–150, 2005.
- 4[BFF + 01a] Tuǧkan Batu, Lance Fortnow, Eldar Fischer, Ravi Kumar, Ronitt Rubinfeld, and Patrick White. Testing random variables for independence and identity. In Bob Werner, editor, Proceedings of the 42nd Annual Symposium on Foundations of Computer Science (FOCS-01) , pages 442–451, Los Alamitos, CA, October 14–17 2001.
- 5[BFF + 01b] Tuǧkan Batu, Lance Fortnow, Eldar Fischer, Ravi Kumar, Ronitt Rubinfeld, and Patrick White. Testing random variables for independence and identity. In 42nd Annual Symposium on Foundations of Computer Science, FOCS 2001, , pages 442–451, 2001.
- 6[BFR + 13] Tuǧkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick White. Testing closeness of discrete distributions. Journal of the ACM , 60(1):4:1–4:25, February 2013.
- 7[Can 15a] Clément L. Canonne. Big data on the rise? - testing monotonicity of distributions. In Automata, Languages, and Programming - 42nd International Colloquium, ICALP 2015, Kyoto, Japan, July 6-10, 2015, Proceedings, Part I , pages 294–305, 2015.
- 8[Can 15b] Clément L. Canonne. A survey on distribution testing: Your data is big. but is it blue? Electronic Colloquium on Computational Complexity (ECCC) , 22:63, 2015.
