Minimax Estimation of the $L_1$ Distance
Jiantao Jiao, Yanjun Han, Tsachy Weissman

TL;DR
This paper develops minimax optimal estimators for the $L_1$ distance between two discrete probability measures, achieving near-optimal performance with fewer samples, and reveals the effective sample size enlargement phenomenon.
Contribution
It introduces new techniques for constructing minimax rate-optimal estimators for $L_1$ distance, extending previous approximation-based methods and analyzing both known and unknown $Q$ scenarios.
Findings
Minimax estimators achieve performance comparable to MLE with fewer samples.
The uniform distribution case is the hardest for estimation.
Effective sample size enlargement phenomenon is confirmed in both known and unknown $Q$ cases.
Abstract
We consider the problem of estimating the distance between two discrete probability measures and from empirical data in a nonasymptotic and large alphabet setting. When is known and one obtains samples from , we show that for every , the minimax rate-optimal estimator with samples achieves performance comparable to that of the maximum likelihood estimator (MLE) with samples. When both and are unknown, we construct minimax rate-optimal estimators whose worst case performance is essentially that of the known case with being uniform, implying that being uniform is essentially the most difficult case. The \emph{effective sample size enlargement} phenomenon, identified in Jiao \emph{et al.} (2015), holds both in the known case for every and the unknown case. However, the construction of optimal estimators for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Minimax Estimation of the Distance
Jiantao Jiao, , Yanjun Han, , and Tsachy Weissman Jiantao Jiao, Yanjun Han, and Tsachy Weissman are with the Department of Electrical Engineering, Stanford University, CA, USA. Email: {jiantao,yjhan, tsachy}@stanford.eduThis work was supported in part by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under grant agreement CCF-0939370. The material in this paper was presented in part at the 2016 IEEE International Symposium on Information Theory, Barcelona, Spain.
Abstract
We consider the problem of estimating the distance between two discrete probability measures and from empirical data in a nonasymptotic and large alphabet setting. When is known and one obtains samples from , we show that for every , the minimax rate-optimal estimator with samples achieves performance comparable to that of the maximum likelihood estimator (MLE) with samples. When both and are unknown, we construct minimax rate-optimal estimators whose worst case performance is essentially that of the known case with being uniform, implying that being uniform is essentially the most difficult case. The effective sample size enlargement phenomenon, identified in Jiao et al. (2015), holds both in the known case for every and the unknown case. However, the construction of optimal estimators for requires new techniques and insights beyond the approximation-based method of functional estimation in Jiao et al. (2015).
Index Terms:
Divergence estimation, total variation distance, multivariate approximation theory, functional estimation, optimal classification error, high-dimensional statistics
I Introduction
I-A Problem formulation
Statistical functionals are usually used to quantify the fundamental limits of data processing tasks such as data compression (e.g. Shannon entropy [1]), data transmission (e.g. mutual information [1]), estimation and testing (e.g. Kullback–Leibler divergence [2, Thm. 11.8.3], distance [3, Chap. 13]), etc. They measure the difficulties of the corresponding data processing tasks and provide benchmarks for constructive algorithms. In this sense, it is of great value to obtain accurate estimates of these functionals in various problems.
In this paper, we consider estimating the distance between two discrete distributions , which is defined as:
[TABLE]
Throughout we use the squared error loss, i.e., the risk function for an estimator is defined as
[TABLE]
where . The maximum risk of an estimator , and the minimax risk in estimating are defined as
[TABLE]
respectively, where are given collections (uncertainty sets) of probability measures and , respectively, and the infimum is taken over all estimators that are functions of the empirical observations.
The distance is closely related to the Bayes error, i.e., the fundamental limit, in classification problems. Specifically, for a two-class classification problem, if the prior probabilities for each class are equal, then the minimum probability of error achieved using the optimal classifier is given by
[TABLE]
where indicates the class, and are the class-conditional distributions. Hence, the problem of estimating in this classification problem is reduced to estimating the distance between the two class-conditional distributions from the empirical data. In the statistical learning theory literature, most work on Bayes classification error estimation deals with the case that and are continuous distributions, and it turns out that it is very difficult to estimate this quantity in the general continuous case. Indeed, we know from [4, Section 8.5] the negative result that for every sample size , any estimate of the Bayes error , and any , there exist some class-conditional distributions such that .
This negative result shows that one needs to look at special classes of the class-conditional distributions in order to obtain meaningful and consistent estimates. In the discrete setting, the seminal work of Valiant and Valiant [5] deserves special mention. They constructed an estimator for and showed that when , it achieves error , and it takes at least samples to achieve consistent estimation of . Valiant and Valiant [6] constructed another estimator of using linear programming which achieves the error when . We argue in this paper that the simplest estimator for , namely plugging in the empirical distribution and obtaining achieves error rate for . In this sense, the optimal estimator seems to enlarge the sample size to in the error rate expression. This phenomenon was termed the effective sample size enlargement in [7].
I-B Approximation-based method
We emphasize that the observed effective sample size enlargement here is another manifestation of the recently discovered phenomenon in functional estimation of high dimensional objects. There has been a recent wave of study on functional estimation of high dimensional parameters [6, 7, 8, 9], and it was shown in Jiao et al. [7] that for a wide class of functional estimation problems (including Shannon entropy , , and mutual information), there exists a general approximation-based method that can be applied to design minimax rate-optimal estimators whose performance with samples is essentially that of the MLE (maximum likelihood estimator, or the plug-in estimator) with samples.
The general approximation-based method in [7] is as follows. Consider estimating of a parameter for an experiment , with a consistent estimator for , where is the number of observations. Suppose the functional is analytic111A function is analytic at a point if and only if its Taylor series about converges to in some neighborhood of . everywhere except at . A natural estimator for is . In the estimation of functionals of discrete distributions, is the -dimensional probability simplex, and a natural candidate for is the empirical distribution, which is unbiased for any .
We propose to conduct the following two-step procedure in estimating .
Classify the Regime: Compute , and declare that we are in the “non-smooth” regime if is “close” enough to . Otherwise declare we are in the “smooth” regime; 2. 2.
Estimate:
- •
If falls in the “smooth” regime, use an estimator “similar” to to estimate ;
- •
If falls in the “non-smooth” regime, replace the functional in the “non-smooth” regime by an approximation (another functional) which can be estimated without bias, then apply an unbiased estimator for the functional .
Approaches of this nature appeared before [7] in Lepski, Nemirovski, and Spokoiny [10], Cai and Low [11], Vinck et al. [12], Valiant and Valiant [5]. It was developed independently for entropy estimation by Wu and Yang [8], and the ideas proved to be very fruitful in Acharya et al. [9], Wu and Yang [13], Orlitsky, Suresh, and Wu [14], Wu and Yang [15]. However, we emphasize that in all the examples above except for the distance estimator in Valiant and Valiant [5], the functionals considered all take the form or , where is a univariate density or function, and each . In particular, the functions considered are everywhere analytic except at zero, e.g., for and . Most of these features are violated in the distance estimation problem. If we write with , then we have:
a bivariate function in the sum; 2. 2.
a function which is analytic except on a segment .
As discussed in Jiao et al. [7], approximation of multivariate functions is much more involved than that of univariate functions, and the fact that the “non-smooth” regime is around a line segment here makes the application of the approximation-based method quite difficult: what shape should we use to specify the “non-smooth” regime? We provide a comprehensive answer to this problem in this paper, thereby substantially generalizing the applicability of the approximation-based method and demonstrate the intricacy of functional estimation problems in high dimensions. Our recent work [16] presents the most up-to-date version of the general approximation-based method, which is applied to construct minimax rate-optimal estimators for the KL divergence (also see Bu et al.[17]), -divergence, and the squared Hellinger distance. The effective sample size enlargement phenomenon holds in all these cases as well.
We emphasize that the complications triggered by the bivariate function make the distance estimation problem highly challenging. Indeed, prior to our work, the only known estimators that require sublinear samples were in [5, 6], which achieved error in the regime of but not the regime , and the lower bound was proved for the regime , i.e., when the optimal error is a constant. The complete characterization of the minimax rates and the estimator that achieves the minimax rates were unknown prior to this work.
Our main contributions in this paper are the following:
We apply the approximation-based method to construct minimax rate-optimal estimators with computational complexity for when is known, and show that for any fixed , our estimator performs with samples at least as well as the plug-in estimator with samples. Precisely, the performance of the plug-in estimator for any fixed is dictated by the functional , while that of the minimax rate-optimal estimator is dictated by the functional . Furthermore, we show that any plug-in estimator would not achieve the same performance as our algorithm does. As we argue in Lemma 8, for estimating with known , for any distribution estimate constructed from the samples from , the estimator does not achieve the minimax rates in the worst case if does not depend on . Concretely, the performance of any plug-in rule behaves essentially as the MLE in the worst case. 2. 2.
We generalize the approximation-based method in [7] to construct a minimax rate-optimal estimator for when both and are unknown with computational complexity . We illustrate the novelty of our scheme via the following results:
- (a)
The performance of our estimator with samples is essentially that of the MLE with samples. 2. (b)
Any algorithm that only conducts approximation around the origin does not achieve the minimax rates. Indeed, as we argue in Lemma 5, for any algorithm that employs the MLE when cannot achieve the minimax rates when . The reason why the estimator of Valiant and Valiant [5] cannot achieve the minimax rates when is that [5] did not conduct approximation when and are large. One of our key contributions is to figure out how to conduct approximation when and achieve the minimax rates when . 3. (c)
Best polynomial approximation is not sufficient for achieving minimax rate-optimality in this problem. As we argue in Lemma 6, any one-dimensional polynomial that achieves the best approximation error rate cannot be used in constructing the optimal estimator, and it is necessary to use a multivariate polynomial with certain pointwise error guarantees. One of our key contributions is to construct a proper multivariate polynomial with desired pointwise approximation error. 4. (d)
Approximation over the union of the “nonsmooth” regime may not work. As we show in Lemma 7, there does not exist a single multivariate polynomial that achieves the desired approximation error over the whole “nonsmooth” regime. Instead, in our approach, we construct polynomial approximations of the function over a random regime that is determined by empirical data. To our knowledge, it is the first time that a random approximation regime approach appears in the functional estimation literature. 5. (e)
Our estimator is agnostic to the potentially unknown support size , but behaves as well as the minimax rate-optimal estimator that knows the support size .
The rest of the paper is organized as follows. In Section II and III, we present a thorough performance analysis of the MLE and explicitly construct the minimax rate-optimal estimators, where Section II covers the known case and Section III generalizes to the case of unknown . Discussions in Section IV highlight the significance and novelty of our approaches by reviewing several other approaches which are shown to be suboptimal. Section V presents the experimental results comparing our schemes with existing approaches. The auxiliary lemmas used throughout this paper are collected in Appendix A. Appendix B contains proofs of the main theorems. Proofs of all the lemmas in the main text and that used in the proofs of the main theorems can be found in Appendix C, where proofs of all the auxiliary lemmas are collected in Appendix D.
Notation: for non-negative sequences , we use the notation to denote that there exists a constant that only depends on such that , and is equivalent to . When the constant is universal we do not write subscripts for and . Notation is equivalent to and . Notation means that , and is equivalent to . We write and . Moreover, denotes the set of all -variate polynomials of degree of each variable no more than , and denotes the distance of the function to the space in the uniform norm on . The space is also abbreviated as . All logarithms are in the natural base. The notation , where is a real number and is a set of real numbers, is equivalent to for all .
Throughout this paper, we utilize the Poisson sampling model instead of the binomial model, whose minimax risks can be shown to be closely related, as in [7, Lemma 16].
II Divergence Estimation with Known
First we consider the case where is known while is an unknown distribution with support . In other words, and . We analyze the performance of the MLE in this case, and construct the approximation-based minimax rate-optimal estimator.
We utilize the Poisson sampling model, in which we observe a Poisson random vector
[TABLE]
where the coordinates of are mutually independent, and . We define as the empirical probabilities.
II-A Performance of the MLE
The MLE serves as a natural estimator for the distance which can be expressed as , where is the empirical distribution. Since we are using the Poisson sampling mode, we have .
We obtain the upper and lower bounds for the mean squared error of in the following theorem.
Theorem 1**.**
The maximum likelihood estimator satisfies
[TABLE]
We can also lower bound the worst case mean squared error as
[TABLE]
The following corollary is straightforward since when .
Corollary 1**.**
If , we have
[TABLE]
Hence, it is necessary and sufficient for the MLE to have samples to be consistent in terms of the worst case mean squared error.
II-B Construction of the optimal estimator
We apply our general recipe to construct the minimax rate-optimal estimator. For simplicity of analysis, we conduct the classical “splitting” operation [18] on the Poisson random vector , and obtain two independent identically distributed random vectors , such that each component in has distribution , and all coordinates in are independent. For each coordinate , the splitting process generates a random sequence , , such that , and assign for . All the random variables are conditionally independent given our observation . The “splitted” empirical probabilities are defined as . To simplify notation, we redefine as to ensure that . We emphasize that the sampling splitting approach is not conducted in the implementation of the estimator.
We construct two set functions with variable as input defined as:
[TABLE]
Here are constants that will be determined later. The set is constructed to satisfy the following property:
Lemma 1**.**
Suppose . Then,
[TABLE]
where the set function is defined in (10).
It is clear that for any . The constants will be chosen later to make sure that the following three “good” events have overwhelming probability:
[TABLE]
Here represents the logical implication operation that is equivalent to . The intuitions behind the constructions of these “good” events are as follows. Since we use the first half of the samples to classify regime, and would later use three different estimators depending on whether lies to the left, to the right, or inside , it is desirable that we can infer the relationship between and based on the location of . The reason why these events can be controlled to have high probabilities is that we have specifically designed to make it a strict subset of the set , and the sets are designed to satisfy Lemma 1, which ensures that the size of is essentially the length of the confidence interval when the empirical probability is observed.
We have the following lemma controlling the probability of these probabilities.
Lemma 2**.**
Denote the overall “good” event , where are defined in (13),(14),(15). Then,
[TABLE]
where
[TABLE]
Now we construct the estimator. In the “smooth” regime, i.e., , we simply employ the plug-in estimator to estimate . In the “non-smooth” regime, i.e., , we need to approximate by another functional which can be estimated without bias. We consider the best polynomial approximation of on , which is defined as
[TABLE]
where denotes the set of polynomials with degree no more than . Once we obtain , we can use an unbiased estimate such that for . As a result, the absolute value of the bias of the estimator in the “non-smooth” regime is exactly the approximation error of in approximating on , which can be significantly smaller than that of the MLE.
Estimator Construction 1**.**
We use the first half samples to classify regimes and the second half samples for estimation. Denote
[TABLE]
and define
[TABLE]
where and are given by (10), (11), , and are properly chosen constants.
The performance of this estimator is presented in the following theorem.
Theorem 2**.**
Suppose there exist two constants such that . Then, there exists constants depending only on in Construction 1 such that
[TABLE]
In particular, if , we have
[TABLE]
Remark 1**.**
When we consider the worst case of , Theorem 2 assumes that the sample size cannot be too big (). It is obvious that an upper bound on the sample size is needed for the statement to be valid: indeed, if no upper bound on the sample size is imposed then in the asymptotic regime ( fixed, ) the convergence rate is faster than the parametric rate , which is impossible. However, we are not sure that the current upper bound is tight. The reason why we introduced this upper bound is that it is needed to control the variance of our estimator, but the variance bound we have may not be tight.
Compared to existing literature, the schemes by Valiant and Valiant [5, 6] achieved mean squared error only in the regime of but not the regime . The main reason is that [5, 6] did not conduct approximation when . As our work shows, the key reason behind whether one should conduct approximation or not is not whether the probability is close to zero or not, but whether the functional has a non-analytic point or not. As we show in Lemma 5 in Section IV, any approach that only conducts approximation when is small cannot achieve the minimax rates for in general.
II-C Minimax lower bound
It was shown in Valiant and Valiant [5] that if is the uniform distribution, when , the minimax risk of estimating is a constant. We prove a minimax lower bound for every , and show that the performance achieved by our estimator in Theorem 2 is minimax rate-optimal for every fixed .
Theorem 3**.**
Suppose there exists a constant such that . Then, there exists a constant that only depends on such that if , then
[TABLE]
where the infimum is taken over all possible estimators.
In particular, if there exist constant such that , then
[TABLE]
Combining Theorem 2 and Theorem 3, we have the following theorem.
Theorem 4**.**
Suppose there exist constants such that . Then,
[TABLE]
In particular, if , then
[TABLE]
The estimator in Construction 1 achieves the minimax rates for every fixed .
III Divergence Estimation with Unknown
Now we consider the general case where both and are unknown to us, i.e., .
We utilize the Poisson sampling model, in which we observe two Poisson random vectors
[TABLE]
where all the coordinates of and are mutually independent, and . We introduce the empirical probabilities .
III-A Performance of the MLE
In this case, the MLE is expressed as . Since by the triangle inequality, and by the conditional Jensen’s inequality, Theorem 1 can again be applied here to give the performance of the MLE.
Theorem 5**.**
If , the MLE satisfies
[TABLE]
Hence, the MLE achieves the mean squared error , and requires samples to be consistent.
III-B Construction of the optimal estimator
Again we apply our general recipe to construct the optimal estimator, but encounter several new difficulties: is non-analytic on a segment, and both the uncertainty set and the polynomial approximation need to be generalized to the 2D case. We will overcome these obstacles step by step.
For simplicity of analysis, we conduct the classical “splitting” operation [18] on the Poisson random vector , and obtain two independent identically distributed random vectors , such that each component in has distribution , and all coordinates in are independent. For each coordinate , the splitting process generates a random sequence such that , and assign for . All the random variables are conditionally independent given our observation . The splitting operation is similarly conducted for the Poisson random vector independently. The “splitted” empirical probabilities are defined as . To simplify notation, we redefine as to ensure that . We emphasize that the sampling splitting approach is not needed for the actual estimator construction.
As usual, first we classify “smooth” and “non-smooth” regimes. Since the function is non-analytic on the segment , we are looking for the “uncertainty set” containing this segment such that any can be “localized” in the previous sense. We have the following lemma.
Lemma 3**.**
The two-dimensional set defined as
[TABLE]
satisfies
[TABLE]
where is given by (10).
We design another set as follows:
[TABLE]
where . Clearly . We choose the constants and later to ensure that the following four events happen with high probability:
[TABLE]
We have the following lemma controlling the probability of these events happening simultaneously.
Lemma 4**.**
Denote the overall “good” event , where are defined in (33),(34),(35), (36). Then, assuming ,
[TABLE]
where the constant is given by
[TABLE]
It is evident that we can make in (38) arbitrarily large by taking large and keeping a small constant. Clearly, if the true parameters , the MLE would be a decent estimator. It suffices to construct estimators when the true parameters . The known case seems to suggest that we consider the best polynomial approximation of on . However, this will not work for two reasons:
the entire 2D stripe is too large for the polynomial approximation error to vanish at the correct rate; 2. 2.
best polynomial approximation in the 2D case is not unique, and may not achieve the desired pointwise error.
We will explore these reasons in details in Section IV. To solve the first problem, we remark that although is the set such that its element can be localized within , a specific element can be localized in a much smaller subset , where is given by (10). Hence, the approximation regime should be dependent on the empirical observations to fully utilize the available information.
For the second problem, we need to design a specific polynomial with satisfactory pointwise approximation properties. Our approximation recipe is the following. Take .
Over the square : we consider the decomposition and introduce the following two bivariate polynomials and to uniformly approximate and , respectively. Specifically, we have
[TABLE]
Then, denote , we use the polynomial
[TABLE]
to approximate over the square . The polynomial satisfies . We remove the constant term in the definition of to guarantee that the estimator we construct is agnostic to the unknown support size . In practice, and can be replaced by the efficiently implementable lowpass filtered Chebyshev expansion [19], which achieves the same uniform error rate as the best polynomial approximation.
Remark 2**.**
We would like to discuss the intuitions behind our construction of the polynomials . One observation is that best approximation, which aims at approximating the bivariate function over the square under the supremum norm, may not work. Indeed, consider the segment over , and the function over this segment can be viewed as a univariate function, whose best approximation error using degree is lower bounded by within a constant factor [20, Chap. 9, Thm. 3.3], which is of order . Hence, the accumulated bias is at least , which results in a worse critical scaling rather than the critical scaling we aim for. The key idea that enabled us to achieve worst case accumulated bias is the and are probability measures satisfying . Hence, it suffices to prove a pointwise bound for each individual symbol . However, to our knowledge, the study of pointwise bounds for multivariate approximation theory has been limited. The decomposition is translating the problem of obtaining pointwise bounds to the problem of obtaining uniform bounds. Indeed, the uniform error of approximating and over with degree are both of order (Lemma 11), and the finite-difference formula precisely gives us the desired pointwise bound. 2. 2.
Once we can assert with high probability , we utilize the best approximation polynomial of on with order . Denote it as
[TABLE]
we have
[TABLE]
where . It is the best approximation polynomial of over interval .
Remark 3**.**
We discuss the reason why we cannot apply the best approximation polynomial of over the square . Note that the approximation width is at least of order since . However, for the square , we easily have , but is the minimum width which ensures concentration properties (Lemma 3). Indeed, as we show in Lemma 6, any 1D approximation polynomial fails to achieve the pointwise error bound we discussed in Remark 2 over .
Finally, we use the second part of the samples to construct the unbiased estimators for defined in (41) and defined in (44). Concretely, we introduce the estimators and such that
[TABLE]
These unbiased estimators are easy to construct since for any , we have [21, Ex. 2.8]
[TABLE]
The final estimator is presented as follows.
Estimator Construction 2**.**
As before, use sample splitting to obtain and . Denote
[TABLE]
and define
[TABLE]
Here is given by (30), is defined in (32), the estimators and are defined in (45) and (46) , and are properly chosen constants, .
A pictorial explanation of the estimator construction is given in Fig 1. Concretely, we use the first sample to classify into four regimes, and in each regime we do the following operations:
Regime I: plug-in: 2. 2.
Regime II: plug-in: 3. 3.
Regime III: 2D polynomial approximation of 4. 4.
Regime IV: 1D polynomial approximation of where with width
The next theorem presents the performance of .
Theorem 6**.**
Suppose there exists a constant such that . Then, there exists that only depend on in Construction 2 such that
[TABLE]
We note that the lower bound for the known case also serves as a lower bound for the unknown case. Indeed, when is known, we can then produce i.i.d. samples from and feed it into any algorithm that handles the unknown case. Hence, Theorem 3 and Theorem 6 yield that is minimax rate-optimal. Note that achieves the minimax rate without knowing the support size a priori. Moreover, the effective sample size enlargement effect holds again: the performance of the optimal estimator with samples is essentially that of the MLE with samples.
IV Comparison with Other Approaches
In this section, we review some other possible approaches in estimating the distance, and apply approximation theory to argue the strict suboptimality of some approaches.
IV-A Approximation only around the origin
In the previous papers [5, 6, 7, 8, 9] in estimating entropy, power sum, mutual information, etc, approximation is conducted only around the origin. However, we remark that this is insufficient in estimating the distance. We have the following result.
Lemma 5**.**
Let denote an estimator of that satisfies the following:
[TABLE]
where the estimator is a bounded function that satisfies when , . Suppose . Then,
[TABLE]
Lemma 5 explains the reason why the estimator of Valiant and Valiant [5] can only achieve the optimal error rate when , but ours achieves the optimal error rate for a much large set of parameter configurations.
IV-B One-dimensional approximation in the 2D case
In the construction of , we split into two cases when , i.e., 1D approximation of via the substitution if , and the decomposition of into otherwise. Can we always do 1D approximation of with to achieve the desired approximation error, i.e., propose some with and for any ? We have the following lemma regarding the approximation of .
Lemma 6**.**
If is even with , and achieves the best uniform error rate , we have
[TABLE]
Now we apply Lemma 6 to the hypothetical polynomial . Doing parameter substitution , by assumption we have for any ,
[TABLE]
where . It follows from Jensen’s inequality that
[TABLE]
Define . It is clear that satisfies the assumptions in Lemma 6. Hence,
[TABLE]
However, it contradicts the upper bound (55) when . Hence, any 1D approximation does not achieve the error rate that is achieved by our 2D approximation approach.
IV-C Approximation on the entire 2D stripe
In the unknown case we have decomposed the stripe into subsets where polynomial approximations take place. Is it possible that we use a single polynomial of degree to approximate such that for any ? We prove that the answer is negative even for and any .
Lemma 7**.**
If , , we have
[TABLE]
Lemma 7 shows that for a too large set (e.g., ), every polynomial fails to achieve the desired approximation error bound . Hence, it is necessary to make the approximation regime be random and dependent on the empirical observations.
IV-D The failure of any plug-in estimator
It is evident that the optimal distance estimators we constructed heavily exploit the interactions of and . For example, in the known case, the estimator for is not of the form , where is an arbitrary function of the empirical distribution of that is independent of .
We show that for any estimator of the distribution , the plug-in estimator does not achieve the minimax rates in estimating when one considers the worst cases among all .
Lemma 8**.**
Consider the known case. Suppose is an arbitrary function of the empirical distribution , and does not depend on . Then, if ,
[TABLE]
Lemma 8 shows that since the plug-in estimator does not explicitly exploit the nonsmoothness of the function , in the worst case it behaves essentially like the maximum likelihood estimator as shown in Corollary 1.
V Experimental Results
In this section, we compare the empirical performances of our algorithms with the following approaches:
- •
maximum likelihood estimator (MLE): it is the approach of plugging-in the empirical distributions obtained through samples into the functional. As shown in Section II and III, it does not achieve the minimax rates in estimating in general in both the known and unknown cases.
- •
Valiant–Valiant estimator [6]: [6] released Matlab code corresponding to their estimator of , which is proved to achieve the minimax rates when , i.e., when the optimal error is a constant. Here denotes the uniform distribution with support size .
- •
approximate profile maximum likelihood estimator (APML) [22]: the APML estimator is an approximate solution of the profile maximum likelihood estimator [23], which can be applied to estimate , and when both and are unknown. It was shown in [22] that the APML estimator exhibits generally good empirical performances, albeit its theoretical properties are not yet understood well.
In the sequel, for each true distribution pair , we fix the parameters in our estimators and vary the sample sizes to compare the estimation performances. We use the root mean squared error (RMSE) as the evaluation criterion.
Figure 2 compares the four approaches mentioned above in estimating , which is also called “distance to uniformity”. We see that our algorithm is consistently better than the maximum likelihood estimator, and is competitive with the VV estimator [6] and APML estimator [22]. Our estimator has computational complexity . Indeed, in the worst case, we may need to evaluate a polynomial with degree for each sample, which results in an overall computational complexity.
Figure 3 compares the performances of the maximum likelihood estimator (MLE), our estimator, and APML in estimating when both and are unknown. Note that we did not choose to compare with [6] since there is no code available for their algorithm in the unknown setting. We find our algorithm to perform consistently better than the maximum likelihood estimator, and is particularly competitive when the distributions and are quite different from each other. Our estimator has computational complexity in the unknown setting. In the worst case, we may need to evaluate a bivariate polynomial with degree in each variable for each sample, which results in an overall computational complexity.
VI Acknowledgements
We are grateful to Vilmos Totik for discussing multivariate approximation theory, and for the insights that motivated the proof of Lemma 7. We would like to thank Gregory Valiant for discussing the estimator in [5]. We are grateful to the associated editor and the anonymous reviewers for constructive comments that have helped significantly improved the presentation of the paper. We thank Irena Fischer-Hwang and Banghua Zhu for the help in preparing the experimental results in Section V.
Appendix A Auxiliary Lemmas
The first-order symmetric difference of a function is given by
[TABLE]
while the second order symmetric difference is given by
[TABLE]
Analogously, the -th order symmetric difference can be defined, and it is zero when or are not inside the domain of .
For function with domain , , the first-order Ditzian–Totik modulus of smoothness is defined as
[TABLE]
and the second-order Ditzian–Totik modulus of smoothness is defined as
[TABLE]
Similarly, we can also define the -th order Ditzian–Totik modulus of smoothness for a function with domain :
[TABLE]
where denotes the symmetric difference with respect to the -th coordinate.
The next lemma upper bounds the best polynomial approximation error by the Ditzian-Totik moduli.
Lemma 9**.**
[24, Thm. 7.2.1, Thm. 12.1.1.]** There exists a constant such that for any function ,
[TABLE]
where denotes the distance of the function to the space in the uniform norm on . Moreover, if , we have
[TABLE]
for any , where is independent of and , and denotes the distance of the function to the space in the uniform norm on .
The modulus is computed for a variety of functions in the following lemma.
Lemma 10**.**
[24, Chap. 3.4]** Suppose . Then,
[TABLE]
where is defined in (61), is defined in (62).
Lemma 11**.**
Suppose , and is a parameter. Then,
[TABLE]
Next lemma computes the Ditzian–Totik modulus for function .
Lemma 12**.**
Suppose . Then, for any integer ,
[TABLE]
where is defined in (62).
Lemma 13** (Markov’s inequality).**
[20, Chap 4, Thm 1.4]** Suppose is defined on . Then,
[TABLE]
Lemma 14**.**
[24, Thm. 7.3.1.]** For the best -th degree polynomial approximation to in and an integer we have
[TABLE]
where and is independent of and .
The next lemma shows that a polynomial on nearly attains its supremum norm in a slightly smaller interval contained in .
Lemma 15**.**
[24, Thm. 8.4.8.]** Suppose is a constant, defined on , . Then, there exists a constant that does not depend on and such that
[TABLE]
Lemma 16**.**
Suppose is the best approximation polynomial with order of function defined as
[TABLE]
Then, the best approximation polynomial with order of function is given by .
The following lemma characterizes the upper bounds of the coefficients of a bounded real polynomial.
Lemma 17**.**
[16]** Let be a polynomial of degree at most such that for . Then
If , then
[TABLE] 2. 2.
If , then
[TABLE]
The following lemma gives an upper bound for the second moment of the unbiased estimate of in Poisson model.
Lemma 18**.**
Suppose . Then, the estimator
[TABLE]
is the unique uniformly minimum variance unbiased estimator for , and its second moment is given by
[TABLE]
where stands for the Laguerre polynomial with order , which is defined as:
[TABLE]
If , we have
[TABLE]
When When , .
We construct the unbiased estimator of when both and are unknown as in the following lemma.
Lemma 19**.**
Suppose . Then, the following estimator using is the unique uniformly minimum variance unbiased estimator for :
[TABLE]
Furthermore,
[TABLE]
The following lemma characterizes the behavior of the central moments of Poisson distributions.
Lemma 20**.**
Suppose . Then, for any integer , there exist constants that are independent of , such that
[TABLE]
Furthermore,
[TABLE]
Consequently, there exists a constant depending only on satisfying such that for any an even integer,
[TABLE]
For odd integer, we have
[TABLE]
We emphasize that the scaling is consistent with the general moment bounds in [25]. However, the results in [25] do not directly apply here. Furthermore, Lemma 20 provides bounds on each individual , which is not obtainable from a general moment bound.
The next lemma controls the moments of , where .
Lemma 21**.**
Suppose . Then, for any integer , there exists a constant depending only on such that
[TABLE]
One may take .
The following lemma gives well-known tail bounds for Poisson and binomial random variables.
Lemma 22**.**
[26, Exercise 4.7]** If or , then for any , we have
[TABLE]
The following lemma presents the Hoeffding bound.
Lemma 23**.**
[27]** Let be independent random variables such that takes its value in almost surely for all . Let , we have for any ,
[TABLE]
The following lemma provides sharp estimates of , where , which can be viewed as an analog of the binomial case studied in [28].
Lemma 24**.**
Suppose . Then,
[TABLE]
Hence,
[TABLE]
Lemma 25**.**
Suppose . Then, for any ,
[TABLE]
Further,
[TABLE]
The next lemma upper bounds the variance of .
Lemma 26**.**
Suppose . Then, for any ,
[TABLE]
Appendix B Proofs of main theorems
B-A Proof of Theorem 1
We have
[TABLE]
Hence,
[TABLE]
where we applied Lemma 25.
To analyze the variance, due to the mutual independence of , we have
[TABLE]
where we used Lemma 26 in the second step.
The proof of the upper bound is complete. Regarding the lower bound, setting , we have
[TABLE]
B-B Proof of Theorem 2
The following lemma gives the bias and variance bound of .
Lemma 27**.**
For with , there exists a universal constant such that
[TABLE]
where is the unique uniformly minimum variance unbiased estimate of defined in (18), is defined in (10) and .
Proof.
Recall the “good” events defined in (13),(14),(15) and define . We have
[TABLE]
where we have applied Lemma 2.
Define the random variables
[TABLE]
where the random index sets are defined as
[TABLE]
The indices are independent of the random variables . Since
[TABLE]
it follows from Cauchy’s inequality that
[TABLE]
where is defined in (17).
It follows from the law of total variance that
[TABLE]
where we have used the fact that with probability one and Lemma 26. Similarly we have .
Regarding , it follows from Lemma 27 and the mutual independence of that
[TABLE]
where .
Hence,
[TABLE]
where is defined in (17) and .
If , one may choose large enough and to ensure that . When , one may choose small enough to ensure that . The worst case of result is proved upon noting that
[TABLE]
In the worst case of we no longer need the condition since we can ensure if we take large enough and .
∎
B-C Proof of Theorem 3
The main tool we employ is the so-called method of two fuzzy hypotheses presented in Tsybakov [29]. Suppose we observe a random vector which has distribution where . Let and be two prior distributions supported on . Write for the marginal distribution of when the prior is for . Let be an arbitrary estimator of a function based on . We have the following general minimax lower bound.
Lemma 28**.**
[29, Thm. 2.15]** Given the setting above, suppose there exist such that
[TABLE]
If , then
[TABLE]
where are the marginal distributions of when the priors are , respectively.
Here is the total variation distance between two probability measures on the measurable space . Concretely, we have
[TABLE]
where , and is a dominating measure so that .
The following lemma was shown in Cai and Low [11]:
Lemma 29**.**
For any given even integer , there exist two probability measures and on that satisfy the following conditions:
* and are symmetric around [math];* 2. 2.
, for ; 3. 3.
,
where is the distance in the uniform norm on from the absolute value function to the space .
It is known that , where is the Bernstein constant [30].
The following lemma deals with the approximation theoretic properties of function .
Lemma 30**.**
For any function , there exists a universal constant such that
[TABLE]
where denotes the distance in the uniform norm on interval from the function to the space .
Similar to Lemma 29, the next lemma constructs two measures for the function . The proof is essentially identical to that of Lemma 29.
Lemma 31**.**
For any and positive integer , , there exist two probability measures on such that
, for all ; 2. 2.
,
where is the distance in the uniform norm on from the function to the space .
The next lemma is an extension of [8, Lemma 3].
Lemma 32**.**
Suppose are two random variables supported on , where are constants. Suppose . Denote the marginal distribution of where as . If , then
[TABLE]
where is the total variation distance defined in (132).
We consider the set of approximate probability vectors
[TABLE]
with some constant . We further define the minimax risk under the Poisson sampling model with respect to with a fixed as
[TABLE]
The following lemma relates to .
Lemma 33**.**
For any and any distribution , we have
[TABLE]
Now we are ready to prove our main minimax lower bound.
Proof.
Fix the distribution . Without loss of generality we assume that . We construct two probability measures on the distribution that will later be used in Lemma 28. Concretely, we use an independent prior generation, and set
[TABLE]
In other words, we assign independent priors to each symbol , and assign a delta mass at to the symbol . The constant will later be set to
[TABLE]
where is the universal constant in Lemma 30, and is a constant.
Now we construct for a generic . We consider two different cases.
, where is a constant. We first construct two new probability measures from the two probability measures constructed in Lemma 31. For , the restriction of is absolutely continuous with respect to , with the Radon–Nikodym derivative given by
[TABLE]
and . Hence, are probability measures on , with the following properties:
- (a)
; 2. (b)
, for all ; 3. (c)
.
The construction of the Radon–Nikodym derivatives are inspired by Wu and Yang [8]. Define
[TABLE]
where is the universal constant in Lemma 30 and is a constant. It follows from the assumption that . Let and let be the measures on defined by for . It then follows that
[TABLE]
[TABLE]
[TABLE] 2. 2.
, where is a constant. Define function , where . Let be the two measures constructed in Lemma 29. We define two new measures by . Let
[TABLE]
It then follows that
[TABLE]
[TABLE]
[TABLE]
Since we have set , where is defined in (140), it is clear that
[TABLE]
Now the construction of the two priors and are complete. In light of Lemma 33, it suffices to lower bound to give a lower bound to .
Let
[TABLE]
We know from (146) and (151) that
[TABLE]
since we have assumed that .
For , introduce the events
[TABLE]
It follows from the union bound that
[TABLE]
Introduce
[TABLE]
It follows from the Hoeffing inequality in Lemma 23 that
[TABLE]
The last step follows from the arguments below. Note that we assumed . We have
[TABLE]
Hence, it suffices to take large enough to ensure that .
Denote by the conditional distribution defined as
[TABLE]
Now consider as two priors and denote the corresponding marginal distributions on the observations as . Note that . Setting
[TABLE]
we have in Lemma 28. The total variation distance is then upper bounded as
[TABLE]
where is the marginal distribution of the observations under priors . It follows from Lemma 32 and the fact that that
[TABLE]
since we have assumed , and we ensure by taking large enough.
It follows from Lemma 28 and Markov’s inequality that
[TABLE]
which together with Lemma 33 implies that
[TABLE]
as long as we choose the constants large enough to guarantee that .
∎
B-D Proof of Theorem 6
We first present the performance of the estimator when .
Lemma 34**.**
Suppose , . Then,
[TABLE]
for some constant . The estimator is introduced in (45), and .
We then analyze the estimator when .
Lemma 35**.**
Suppose , where the set is defined in (30). Suppose . Then,
[TABLE]
for some constant . The estimator is introduced in (46), and .
Proof.
Recall the “good” events defined in (33),(34),(35),(36) and introduce . We have
[TABLE]
where we have applied Lemma 4 and the constant is defined in (38).
Define the random variables
[TABLE]
where the random index sets are defined as
[TABLE]
The index sets are independent of the random variables and . It follows from the definition of the ’s that
[TABLE]
Hence, it follows from the Cauchy–Schwarz inequality that
[TABLE]
It follows from the law of total variance that
[TABLE]
where we have used the fact that with probability one, the independence of and , and Lemma 26. Similarly we have .
Regarding , it follows from Lemma 34 and the mutual independence of and that
[TABLE]
where .
Regarding , it follows from the bias-variance decomposition and Lemma 35 that
[TABLE]
where the constant is the one in Lemma 35. Taking expectations with respect to , we have
[TABLE]
where .
Combining everything together, we have
[TABLE]
where , and the constant is the larger constant between the one in Lemma 34 and Lemma 35. The constant is in (38).
If , we can take small enough and large enough to guarantee that . Upon noting that , we have
[TABLE]
∎
Appendix C Proofs of main lemmas
C-A Proof of Lemma 1
We first consider the case of . In this case,
[TABLE]
where we used Lemma 22 in the last step. When , we have
[TABLE]
where we applied Lemma 22 again.
C-B Proof of Lemma 2
Since
[TABLE]
it suffices to control separately. We have
[TABLE]
Note that if , then it follows from Lemma 22 that
[TABLE]
If , then it follows from Lemma 22 that
[TABLE]
Hence,
[TABLE]
Analogously, when , and when ,
[TABLE]
Hence,
[TABLE]
As for , when ,
[TABLE]
When ,
[TABLE]
[TABLE]
Consequently,
[TABLE]
C-C Proof of Lemma 3
It is clear that the square . To see how we obtained the whole expression of , for any , we study the envelope of the parametrized extremal points , where the other curve can be dealt with analogously.
For , we have
[TABLE]
Hence,
[TABLE]
We have that for all points ,
[TABLE]
where we used the inequality in the last step.
C-D Proof of Lemma 4
It follows from the union bound that
[TABLE]
Hence, it suffices to analyze each .
Analysis of :
[TABLE]
It follows from Lemma 3 that the set . Hence,
[TABLE]
It follows from Lemma 1 that
[TABLE] 2. 2.
Analysis of : following similar steps as in the analysis of , we have . 3. 3.
Analysis of :
[TABLE]
where we have used the fact that and Lemma 22. 4. 4.
Analysis of :
[TABLE]
We have
[TABLE]
and
[TABLE]
It suffices to show that there exists some constant such that
[TABLE]
where is defined in (10). Indeed, in this case it follows from Lemma 1 that
[TABLE]
Now we work to prove (279). Without loss of generality we assume satisfies and the constant . Under this assumption we have . We will show that for any point , we have , thereby proving (279).
If , we have for any ,
[TABLE]
where in the second step we used the fact that the function is monotonically increasing when . Hence, we need to guarantee that
[TABLE]
which can be reduced to the quadratic inequality:
[TABLE]
One can easily verify that satisfies this inequality since .
Now we consider the case of . Then, for any ,
[TABLE]
Further, since ,
[TABLE]
where we used the fact that is a monotonically increasing function of when , and the function is a monotonically decreasing function of when . To guarantee that , we need
[TABLE]
which is equivalent to
[TABLE]
with the constraint that .
C-E Proof of Lemma 5
We consider two different parameter settings.
: In this case, we construct the distribution as 222Technically, the distribution has support no more than . However, a standard continuity argument implies that the same conclusion holds.
[TABLE]
where is a constant that will be chosen later, and . Without loss of generality we assume is an integer. We now argue that for each index ,
[TABLE]
It follows from Lemma 22 that , where . Note that can be made arbitrarily large by taking the constant large. Define . We have
[TABLE]
Since , we have
[TABLE]
It follows from the triangle inequality that
[TABLE]
It follows from the conditional version of Jensen’s inequality that , and by Lemma 24 we have
[TABLE]
Since for , we conclude that (300) is true. Hence, the total bias of is at least since . 2. 2.
: In this case, we construct to be uniform distributions with support size . Since , it follows from arguments analogous to those above that the squared bias of is at least the order .
C-F Proof of Lemma 6
Since , it suffices to show that there exists a universal constant such that
[TABLE]
for . Define . Since is even, it follows that , where is a polynomial. The polynomial satisfies the following:
[TABLE]
It suffices to show that . Let denote the best approximation polynomial of the function on with order no more than . It follows from Lemma 9 and Lemma 10 that . It follows from the triangle inequality that
[TABLE]
It follows from the Markov inequality (Lemma 13) that . Since for any ,
[TABLE]
it suffices to show .
It follows from Lemma 14 and Lemma 10 that
[TABLE]
Hence, it follows from Lemma 15 that
[TABLE]
Hence,
[TABLE]
The proof is complete.
C-G Proof of Lemma 7
We prove the lemma by contradiction. Assuming the contrary, then there exist universal constants and polynomial of degree such that
[TABLE]
where . Now for any , we have , and plugging in this pair yields
[TABLE]
Similarly, for we also have
[TABLE]
Now consider
[TABLE]
it is easy to see that is a polynomial of , and . Moreover, adding the previous two inequalities together, by triangle inequality we obtain
[TABLE]
Since , we have . Define for , (328) becomes
[TABLE]
Moreover, . Now let be the best degree- approximating polynomial of in the uniform norm on , using second-order Ditzian–Totik modulus of smoothness (Lemma 9) and we arrive at
[TABLE]
Furthermore, following the proof of Lemma 6 we can prove that
[TABLE]
As a result, by triangle inequality we have
[TABLE]
Since is also a polynomial of degree , by Markov’s inequality (Lemma 13)
[TABLE]
and finally by triangle inequality again
[TABLE]
Now we are about to arrive at the desired contradiction. Choosing and in (329), we have
[TABLE]
with a suitable universal constant appearing in the RHS of (329). As a result,
[TABLE]
and by the mean value theorem we conclude that there exists some such that
[TABLE]
where the last inequality follows from the fact that . However, this inequality is contradicting to our previous result (336), and thus we are done. ∎
C-H Proof of Lemma 8
We have
[TABLE]
where the last step follows from the result of minimax risk for estimating the discrete distribution under loss in [31, Cor. 4].
C-I Proof of Lemma 27
To simplify the notation we denote . We split the proof into two cases: and .
The case . In this case, it follows from (18) that is the best approximation polynomial of function over . Define and introduce function
[TABLE]
Define the best approximation polynomial of with order as
[TABLE]
It follows from Lemma 9 and 12 that there exists a universal constant such that
[TABLE]
Since the approximation performance of is at least as good as that of a constant, and , we know that there exists another universal constant such that
[TABLE]
Denoting and using , we know
[TABLE]
It follows from Lemma 18 that is the unique uniformly minimum variance unbiased estimator of when . Hence,
[TABLE]
and . Since is a polynomial with degree no more than and satisfies
[TABLE]
It follows from Lemma 17 that for all ,
[TABLE]
Now we prove the variance properties of . We have
[TABLE]
where we have applied Lemma 18 with since we have assumed .
Since , we have
[TABLE]
where is some universal constant. 2. 2.
The case . In this case, it follows from (18) that is the best approximation polynomial of function over . Denote the best approximation polynomial of on with order as
[TABLE]
Using , we have
[TABLE]
It is well known that [20, Chap. 9, Thm. 3.3] there exists a universal constant such that
[TABLE]
Consequently, for ,
[TABLE]
It follows from Lemma 18 that defined as
[TABLE]
is the unique uniformly minimum variance unbiased estimator for . Hence,
[TABLE]
It was shown in Cai and Low [11, Lemma 2] that . We study the variance properties of as follows.
Define . Note that if the variance of this is zero. We now consider . Applying Lemma 18 and the fact that the standard deviation of a sum of random variables is upper bounded by the sum of standard deviations of corresponding random variables, we have
[TABLE]
where , and is some universal constant. Recall that . It suffices to show to complete the proof. Indeed, we have
[TABLE]
C-J Proof of Lemma 30
It is clear that . Introduce
[TABLE]
where . We have . Recall the second-order Ditzian–Totik modulus of smoothness given in (62)
[TABLE]
where .
We deal with the two cases separately.
: Denote . It is easy to verify that if , then , which ensures that . We lower bound for as follows:
[TABLE]
Since
[TABLE]
we have
[TABLE]
The relationship between and was shown in [24, Thm. 7.2.4.] that there exists a universal positive constant such that
[TABLE]
Utilizing the non-increasing property of with respect to yields
[TABLE]
Now we work out an upper bound on . It follows from Lemma 9 that there exists a universal constant such that
[TABLE]
where , where . It follows from straightforward algebra that . Hence,
[TABLE]
Since , for large enough, we know and there exist two universal constants such that
[TABLE]
where we used the fact that for . Hence,
[TABLE]
when is large enough. 2. 2.
: for we have
[TABLE]
where .
Since
[TABLE]
it suffices to lower bound to lower bound . Note that the function is a non-decreasing function.
We have for large enough, and
[TABLE]
where we used the fact that the function is a non-increasing function for .
Hence, we have shown that
[TABLE]
Following (393), we have that
[TABLE]
when is large enough. Here we used the fact that .
C-K Proof of Lemma 32
We have
[TABLE]
For each , we introduce function
[TABLE]
where . We introduce . It follows from the assumptions that .
We write the series expansion of as follows:
[TABLE]
Hence,
[TABLE]
where we used the fact that .
It follows from the Leibniz formula for derivatives of products of functions that
[TABLE]
Hence,
[TABLE]
Construct random variable . Then,
[TABLE]
where .
Consequently,
[TABLE]
where is the estimator introduced in Lemma 18 for the case of .
It follows from Lemma 18 that
[TABLE]
Hence, it follows from that
[TABLE]
It follows from the assumptions that
[TABLE]
Consequently,
[TABLE]
C-L Proof of Lemma 33
We define the minimax risk under the multinomial sampling model for a fixed as
[TABLE]
Fix . Let be a near-minimax estimator of under the multinomial model for every sample size , which means that for every sample size ,
[TABLE]
Here the random vector follows multinomial distribution parametrized by , and the estimator obtains the number of samples from this random vector.
Now we consider the Poisson sampling model, where ’s are mutually independent with marginal distributions . Let . We use the estimator to estimate under the Poisson sampling model. For any under the Poisson sampling model, we have
[TABLE]
where we used the fact that for any , and the fact that if , then
[TABLE]
Then,
[TABLE]
where we used the fact that conditioned on , follows multinomial distribution parametrized by , the monotonicity of as a function of , , and Lemma 22.
Taking supremum of over and using the arbitrariness of , we have
[TABLE]
which is equivalent to
[TABLE]
It follows from [7, Lemma 16] that . Hence,
[TABLE]
C-M Proof of Lemma 34
We first analyze the bias. To simplify the notation we denote . It follows from the definition of that for ,
[TABLE]
where , and and satisfy (1).
We first argue that there exists a universal constant such that . Indeed,
[TABLE]
It follows from Lemma 9 and Lemma 11 that the best polynomial approximation error of and over the unit square are both of order . Hence,
[TABLE]
which implies that there exists another constant such that
[TABLE]
Denote , we have
[TABLE]
We now analyze the variance. Express the polynomial explicitly as
[TABLE]
For any fixed value of , is a polynomial of with degree no more than that is uniformly bounded by a universal constant on . It follows from Lemma 17 that for any fixed ,
[TABLE]
which, together with Lemma 17, implies that
[TABLE]
Since is the unbiased estimator of , we know
[TABLE]
where is the unbiased estimator for introduced in Lemma 18.
Denote and . Using the triangle inequality of the norm and Lemma 18, we know
[TABLE]
Since for any ,
[TABLE]
we know
[TABLE]
for some constant . Hence,
[TABLE]
C-N Proof of Lemma 35
We first analyze the bias. It follows from the definition of that
[TABLE]
where .
Since , we know
[TABLE]
where we have used the fact that and the assumption that .
Hence, it follows from the property that the best degree- polynomial approximation error of over is [20, Chap. 9, Thm. 3.3] that
[TABLE]
Then we analyze the variance. It was shown in Cai and Low [11, Lemma 2] that . Denote the unbiased estimator of by and introduce the norm . It follows from the triangle inequality of the norm and the fact that constants have zero variance that
[TABLE]
It follows from Lemma 19 that
[TABLE]
Hence,
[TABLE]
where
[TABLE]
Consequently,
[TABLE]
where is a constant.
Appendix D Proofs of auxiliary lemmas
D-A Proof of Lemma 11
We split the analysis of into two cases:
or : in this case,
[TABLE]
where we have used the fact that and . 2. 2.
: in this case
[TABLE]
D-B Proof of Lemma 12
It follows from taking derivatives that for convex function , the function is a nondecreasing function of . Since is a convex function, it follows from straightforward algebra that
[TABLE]
where
[TABLE]
We break the proof into three parts.
We first prove that when , the maximum of achieved by at .
Consider first the case and function . If , without loss of generality we can assume , since otherwise . Then,
[TABLE]
Taking derivative with respect to , it suffices to show this derivative is non-positive when . We have the derivative expressed as
[TABLE]
Since is a convex function, it achieves its maximum at the endpoints. When we set and it is both negative. Similar arguments work for the case of . Hence, we conclude that when ,
[TABLE]
Consider the case and the function . It suffices to assume since otherwise . In this case
[TABLE]
which is a decreasing function in , implying . Similar arguments work for the and case. 2. 2.
We now prove that when , the maximum is achieved by at .
In this case, it suffices to consider . Indeed, if , then the second order difference in the non-zero case is given by (1), which is shown to be a decreasing function when . Now consider . We discuss three cases separately:
- (a)
: in this case, . 2. (b)
: in this case,
[TABLE]
which is an increasing function of . It implies that in this regime one should take . The resulting is . 3. (c)
: in this case, the second order difference is
[TABLE]
which is independent of .
Hence, we have shown that for , the maximum is achieved by and
[TABLE] 3. 3.
The case of can be dealt with in a fashion similar to the case of , resulting in
[TABLE]
D-C Proof of Lemma 16
It suffices to show that for any polynomial ,
[TABLE]
Define
[TABLE]
It is clear that is an even function, is an odd function, and . We have
[TABLE]
where we have used the fact that and the convexity of the function .
There exists another polynomial such that . Hence, for any ,
[TABLE]
where we used the definition of . The proof is complete.
D-D Proof of Lemma 18
The Charlier polynomial is defined as follows:
[TABLE]
where is the falling factorial. It satisfies the following generating function relation [32]:
[TABLE]
Substituting by , we have
[TABLE]
Note that we have
[TABLE]
which is well defined even for . If , then may be defined as
[TABLE]
We note that relation (544) is true also when . Indeed, the case reduces to the relation:
[TABLE]
Assuming , replacing with random variable in (544) and taking expectation on both sides, we have
[TABLE]
Note that does not depend on . Hence we know
[TABLE]
Thus, if , we have
[TABLE]
Expanding implies that it is equal to defined in Lemma 18. The estimator being the unique uniformly minimum variance unbiased estimator of follows from the Lehmann–Scheffe Theorem [33, Chap. 2, Thm. 1.11] and the complete sufficiency of in model ([33, Chap. 1, Thm. 6.22]).
Now we proceed to bound the second moment of . It follows from (544) that for any ,
[TABLE]
which implies that
[TABLE]
It follows from coefficient matching that
[TABLE]
which simplifies to
[TABLE]
Now assume . Taking and dividing both sides by , we have
[TABLE]
The Charlier polynomials are orthogonal with respect to the Poisson measure. Concretely, for [32],
[TABLE]
For , we have
[TABLE]
which is also true for .
Applying the orthogonal property of Charlier polynomials and assuming , we have
[TABLE]
where stands for the Laguerre polynomial with order , which is defined as:
[TABLE]
If we further assume , we have
[TABLE]
D-E Proof of Lemma 19
It follows from the fact that for [21, Ex. 2.8] that is unbiased for estimating . It follows from the Lehmann–Scheffe Theorem [33, Chap. 2, Thm. 1.11] and the complete sufficiency of ([33, Chap. 1, Thm. 6.22]) that is the unique uniformly minimum variance unbiased estimator for .
We now work out a different form of . It follows from the binomial theorem that for any fixed ,
[TABLE]
Clearly, the following estimator is also unbiased for estimating :
[TABLE]
where and are the unique uniformly minimum variance unbiased estimators for and introduced in Lemma 18, respectively. It follows from the uniqueness of that
[TABLE]
Using and the triangle inequality for the norm , we have
[TABLE]
where we have used the independence of and in the last step.
Define , , and set . Define . It follows from Lemma 18 that
[TABLE]
D-F Proof of Lemma 20
Equation (84) follows from [24, Lemma 9.5.5.]. Now we prove the bound on the magnitude of . Note that the moment generating function of is given by
[TABLE]
Written as formal power series of , the previous identity becomes
[TABLE]
Hence, by comparing the coefficient of at both sides, we obtain
[TABLE]
Moreover,
[TABLE]
Then,
[TABLE]
Since , we have
[TABLE]
Now we consider the maximization problem . It follows from taking derivatives that this function attains it unique maximum at point which satisfies the following:
[TABLE]
Recall the Lambert function is defined over by the equation , we know that
[TABLE]
The following upper bound on was proved in [34]: for any ,
[TABLE]
where we have used the fact that .
Hence, for any ,
[TABLE]
which turns out to be also correct for since .
D-G Proof of Lemma 21
It is clear that when , the statement is true. It suffices to consider the case of . Introduce function as follows:
[TABLE]
It is evident that and
[TABLE]
We have
[TABLE]
Since the function is continuously differentiable on , we have
[TABLE]
where we applied Lemma 22 in the last step. Hence,
[TABLE]
where in the last step we used the the assumption that . Consequently,
[TABLE]
where we have used the fact that for any ,
[TABLE]
D-H Proof of Lemma 24
The following upper bound is straightforward:
[TABLE]
Regarding the other upper bound and the lower bound, we utilize the exact analytic expression [35] for for . It follows from [35] that for random variable ,
[TABLE]
where denotes the greatest integer less than or equal to .
When , we have
[TABLE]
which implies that if ,
[TABLE]
Regarding the final lower bound, it suffices to show that for , we have
[TABLE]
Hence, it suffices to show
[TABLE]
for all . It is equivalent to
[TABLE]
for for all the integers .
Since the function is monotonically increasing for , and monotonically decreasing for , it suffices to consider integers . Hence, it suffices to show for any integer ,
[TABLE]
which is equivalent to
[TABLE]
It follows from [36] that for any positive integer ,
[TABLE]
which implies (624) since for all positive integers.
D-I Proof of Lemma 25
We first assume . Applying the relation
[TABLE]
where , we have
[TABLE]
Construct random variable such that is on the same probability space as , with the relationship , where is independent of and . Hence, with probability one. We have
[TABLE]
where we applied Lemma 24 in the last step. The case of can be proved analogously.
Regarding the lower bound, we have
[TABLE]
where we lower bound via taking and using Lemma 24.
D-J Proof of Lemma 26
For ,
[TABLE]
where we used the fact that .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal , vol. 27, pp. 379–423, 623–656, 1948.
- 2[2] T. M. Cover and J. A. Thomas, Elements of Information Theory , 2nd ed. New York: Wiley, 2006.
- 3[3] E. L. Lehmann and J. P. Romano, Testing statistical hypotheses . Springer, 2005.
- 4[4] L. Devroye, L. Györfi, and G. Lugosi, “A probabilistic theory of pattern recognition,” 1996.
- 5[5] G. Valiant and P. Valiant, “The power of linear estimators,” in Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual Symposium on . IEEE, 2011, pp. 403–412.
- 6[6] P. Valiant and G. Valiant, “Estimating the unseen: improved estimators for entropy and other properties,” in Advances in Neural Information Processing Systems , 2013, pp. 2157–2165.
- 7[7] J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Minimax estimation of functionals of discrete distributions,” Information Theory, IEEE Transactions on , vol. 61, no. 5, pp. 2835–2885, 2015.
- 8[8] Y. Wu and P. Yang, “Minimax rates of entropy estimation on large alphabets via best polynomial approximation,” IEEE Transactions on Information Theory , vol. 62, no. 6, pp. 3702–3720, 2016.
