Minimax Estimation of the $L_1$ Distance

Jiantao Jiao; Yanjun Han; Tsachy Weissman

arXiv:1705.00807·math.ST·June 26, 2018

Minimax Estimation of the $L_1$ Distance

Jiantao Jiao, Yanjun Han, Tsachy Weissman

PDF

TL;DR

This paper develops minimax optimal estimators for the $L_1$ distance between two discrete probability measures, achieving near-optimal performance with fewer samples, and reveals the effective sample size enlargement phenomenon.

Contribution

It introduces new techniques for constructing minimax rate-optimal estimators for $L_1$ distance, extending previous approximation-based methods and analyzing both known and unknown $Q$ scenarios.

Findings

01

Minimax estimators achieve performance comparable to MLE with fewer samples.

02

The uniform distribution case is the hardest for estimation.

03

Effective sample size enlargement phenomenon is confirmed in both known and unknown $Q$ cases.

Abstract

We consider the problem of estimating the $L_{1}$ distance between two discrete probability measures $P$ and $Q$ from empirical data in a nonasymptotic and large alphabet setting. When $Q$ is known and one obtains $n$ samples from $P$ , we show that for every $Q$ , the minimax rate-optimal estimator with $n$ samples achieves performance comparable to that of the maximum likelihood estimator (MLE) with $n ln n$ samples. When both $P$ and $Q$ are unknown, we construct minimax rate-optimal estimators whose worst case performance is essentially that of the known $Q$ case with $Q$ being uniform, implying that $Q$ being uniform is essentially the most difficult case. The \emph{effective sample size enlargement} phenomenon, identified in Jiao \emph{et al.} (2015), holds both in the known $Q$ case for every $Q$ and the $Q$ unknown case. However, the construction of optimal estimators for…

Equations1094

∥ P - Q ∥_{1} ≜ i = 1 \sum S ∣ p_{i} - q_{i} ∣.

∥ P - Q ∥_{1} ≜ i = 1 \sum S ∣ p_{i} - q_{i} ∣.

R (P, Q; \hat{L}) ≜ E ∣ \hat{L} (X^{n}, Y^{n}) - ∥ P - Q ∥_{1} ∣^{2},

R (P, Q; \hat{L}) ≜ E ∣ \hat{L} (X^{n}, Y^{n}) - ∥ P - Q ∥_{1} ∣^{2},

R_{maximum} (P, Q; \hat{L})

R_{maximum} (P, Q; \hat{L})

R_{minimax} (P, Q)

L^{*} = \frac{1}{2} - \frac{1}{4} ∥ P_{X ∣ Y = 1} - P_{X ∣ Y = 0} ∥_{1},

L^{*} = \frac{1}{2} - \frac{1}{4} ∥ P_{X ∣ Y = 1} - P_{X ∣ Y = 0} ∥_{1},

X = [X_{1}, X_{2}, \dots, X_{S}],

X = [X_{1}, X_{2}, \dots, X_{S}],

P \in M_{S} sup E_{P} ∣∥ P_{n} - Q ∥_{1} - ∥ P - Q ∥_{1} ∣^{2}

P \in M_{S} sup E_{P} ∣∥ P_{n} - Q ∥_{1} - ∥ P - Q ∥_{1} ∣^{2}

\leq 4 (i = 1 \sum S q_{i} \land \frac{q _{i}}{n})^{2} + \frac{1}{n} .

P \in M_{S} sup E_{P} ∣∥ P_{n} - Q ∥_{1} - ∥ P - Q ∥_{1} ∣^{2} \geq \frac{1}{2} (i = 1 \sum S q_{i} \land \frac{q _{i}}{n})^{2} .

P \in M_{S} sup E_{P} ∣∥ P_{n} - Q ∥_{1} - ∥ P - Q ∥_{1} ∣^{2} \geq \frac{1}{2} (i = 1 \sum S q_{i} \land \frac{q _{i}}{n})^{2} .

P, Q \in M_{S} sup E_{P} ∣∥ P_{n} - Q ∥_{1} - ∥ P - Q ∥_{1} ∣^{2} ≍ \frac{S}{n} .

P, Q \in M_{S} sup E_{P} ∣∥ P_{n} - Q ∥_{1} - ∥ P - Q ∥_{1} ∣^{2} ≍ \frac{S}{n} .

U (q; c_{1})

U (q; c_{1})

U_{1} (q)

P (\overset{q}{^} \in / U (q; c_{1})) \leq \frac{2}{n ^{c_{1} /3}},

P (\overset{q}{^} \in / U (q; c_{1})) \leq \frac{2}{n ^{c_{1} /3}},

E_{1}

E_{1}

E_{2}

E_{3}

P (E^{c}) \leq \frac{3 S}{n ^{β}},

P (E^{c}) \leq \frac{3 S}{n ^{β}},

β = min {\frac{c _{3}^{2}}{3 c _{1}}, \frac{( c _{1} - c _{3} ) ^{2}}{4 c _{1}}, \frac{( c _{1} - c _{3} ) ^{2}}{3}} .

β = min {\frac{c _{3}^{2}}{3 c _{1}}, \frac{( c _{1} - c _{3} ) ^{2}}{4 c _{1}}, \frac{( c _{1} - c _{3} ) ^{2}}{3}} .

P_{K} (x; q) = P \in poly_{K} \mbox argmin z \in U (q; c_{1}) max ∣ f (z, q) - P (z) ∣

P_{K} (x; q) = P \in poly_{K} \mbox argmin z \in U (q; c_{1}) max ∣ f (z, q) - P (z) ∣

\tilde{L}_{1}

\tilde{L}_{1}

+ (q_{i} - \overset{p}{^}_{i, 2}) \mathbbm 1 (\overset{p}{^}_{i, 1} < U_{1} (q_{i}))

+ \tilde{P}_{K} (\overset{p}{^}_{i, 2}; q_{i}) \mathbbm 1 (\overset{p}{^}_{i, 1} \in U_{1} (q_{i}))]

\hat{L}^{(1)} = 0 \lor (\tilde{L}_{1} \land 2),

\hat{L}^{(1)} = 0 \lor (\tilde{L}_{1} \land 2),

P \in M_{S} sup E_{P} ∣ \hat{L}^{(1)} - ∥ P - Q ∥_{1} ∣^{2} ≲_{c, C} (i = 1 \sum S q_{i} \land \frac{q _{i}}{n ln n})^{2} .

P \in M_{S} sup E_{P} ∣ \hat{L}^{(1)} - ∥ P - Q ∥_{1} ∣^{2} ≲_{c, C} (i = 1 \sum S q_{i} \land \frac{q _{i}}{n ln n})^{2} .

P, Q \in M_{S} sup E_{P} ∣ \hat{L}^{(1)} - ∥ P - Q ∥_{1} ∣^{2} ≲_{C} \frac{S}{n ln n} .

P, Q \in M_{S} sup E_{P} ∣ \hat{L}^{(1)} - ∥ P - Q ∥_{1} ∣^{2} ≲_{C} \frac{S}{n ln n} .

\hat{L} in f P \in M_{S} sup E_{P} ∣ \hat{L} - ∥ P - Q ∥_{1} ∣^{2} ≳_{C} (i = 1 \sum S q_{i} \land \frac{q _{i}}{n ln n})^{2},

\hat{L} in f P \in M_{S} sup E_{P} ∣ \hat{L} - ∥ P - Q ∥_{1} ∣^{2} ≳_{C} (i = 1 \sum S q_{i} \land \frac{q _{i}}{n ln n})^{2},

Q \in M_{S} sup \hat{L} in f P \in M_{S} sup E_{P} ∣ \hat{L} - ∥ P - Q ∥_{1} ∣^{2} ≳_{c, C} \frac{S}{n ln n} .

Q \in M_{S} sup \hat{L} in f P \in M_{S} sup E_{P} ∣ \hat{L} - ∥ P - Q ∥_{1} ∣^{2} ≳_{c, C} \frac{S}{n ln n} .

\hat{L} in f P \in M_{S} sup E_{P} ∣ \hat{L} - ∥ P - Q ∥_{1} ∣^{2}

\hat{L} in f P \in M_{S} sup E_{P} ∣ \hat{L} - ∥ P - Q ∥_{1} ∣^{2}

Q \in M_{S} sup \hat{L} in f P \in M_{S} sup E_{P} ∣ \hat{L} - ∥ P - Q ∥_{1} ∣^{2}

Q \in M_{S} sup \hat{L} in f P \in M_{S} sup E_{P} ∣ \hat{L} - ∥ P - Q ∥_{1} ∣^{2}

X

X

Y

P, Q \in M_{S} sup E ∣∥ P_{n} - Q_{n} ∥_{1} - ∥ P - Q ∥_{1} ∣^{2} ≍ \frac{S}{n} .

P, Q \in M_{S} sup E ∣∥ P_{n} - Q_{n} ∥_{1} - ∥ P - Q ∥_{1} ∣^{2} ≍ \frac{S}{n} .

U

U

\displaystyle\qquad\qquad p\in[0,1],q\in[0,1]\Bigg{\}}

U \supset \cup_{x \in [0, 1]} U (x; c_{1}) \times U (x; c_{1}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Minimax Estimation of the $L_{1}$ Distance

Jiantao Jiao, , Yanjun Han, , and Tsachy Weissman Jiantao Jiao, Yanjun Han, and Tsachy Weissman are with the Department of Electrical Engineering, Stanford University, CA, USA. Email: {jiantao,yjhan, tsachy}@stanford.eduThis work was supported in part by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under grant agreement CCF-0939370. The material in this paper was presented in part at the 2016 IEEE International Symposium on Information Theory, Barcelona, Spain.

Abstract

We consider the problem of estimating the $L_{1}$ distance between two discrete probability measures $P$ and $Q$ from empirical data in a nonasymptotic and large alphabet setting. When $Q$ is known and one obtains $n$ samples from $P$ , we show that for every $Q$ , the minimax rate-optimal estimator with $n$ samples achieves performance comparable to that of the maximum likelihood estimator (MLE) with $n\ln n$ samples. When both $P$ and $Q$ are unknown, we construct minimax rate-optimal estimators whose worst case performance is essentially that of the known $Q$ case with $Q$ being uniform, implying that $Q$ being uniform is essentially the most difficult case. The effective sample size enlargement phenomenon, identified in Jiao et al. (2015), holds both in the known $Q$ case for every $Q$ and the $Q$ unknown case. However, the construction of optimal estimators for $\|P-Q\|_{1}$ requires new techniques and insights beyond the approximation-based method of functional estimation in Jiao et al. (2015).

Index Terms:

Divergence estimation, total variation distance, multivariate approximation theory, functional estimation, optimal classification error, high-dimensional statistics

I Introduction

I-A Problem formulation

Statistical functionals are usually used to quantify the fundamental limits of data processing tasks such as data compression (e.g. Shannon entropy [1]), data transmission (e.g. mutual information [1]), estimation and testing (e.g. Kullback–Leibler divergence [2, Thm. 11.8.3], $L_{1}$ distance [3, Chap. 13]), etc. They measure the difficulties of the corresponding data processing tasks and provide benchmarks for constructive algorithms. In this sense, it is of great value to obtain accurate estimates of these functionals in various problems.

In this paper, we consider estimating the $L_{1}$ distance between two discrete distributions $P=(p_{1},p_{2},\ldots,p_{S}),Q=(q_{1},q_{2},\ldots,q_{S})$ , which is defined as:

[TABLE]

Throughout we use the squared error loss, i.e., the risk function for an estimator $\hat{L}$ is defined as

[TABLE]

where $(X_{i},Y_{i})\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}P\times Q$ . The maximum risk of an estimator $\hat{L}$ , and the minimax risk in estimating $\|P-Q\|_{1}$ are defined as

[TABLE]

respectively, where $\mathcal{P},\mathcal{Q}$ are given collections (uncertainty sets) of probability measures $P$ and $Q$ , respectively, and the infimum is taken over all estimators $\hat{L}$ that are functions of the empirical observations.

The $L_{1}$ distance is closely related to the Bayes error, i.e., the fundamental limit, in classification problems. Specifically, for a two-class classification problem, if the prior probabilities for each class are equal, then the minimum probability of error achieved using the optimal classifier is given by

[TABLE]

where $Y\in\{0,1\}$ indicates the class, and $P_{X|Y}$ are the class-conditional distributions. Hence, the problem of estimating $L^{*}$ in this classification problem is reduced to estimating the $L_{1}$ distance between the two class-conditional distributions $P_{X|Y=1},P_{X|Y=0}$ from the empirical data. In the statistical learning theory literature, most work on Bayes classification error estimation deals with the case that $P_{X|Y=1}$ and $P_{X|Y=0}$ are continuous distributions, and it turns out that it is very difficult to estimate this quantity in the general continuous case. Indeed, we know from [4, Section 8.5] the negative result that for every sample size $n$ , any estimate of the Bayes error $\hat{L}_{n}$ , and any $\epsilon>0$ , there exist some class-conditional distributions such that $\mathbb{E}|\hat{L}_{n}-L^{*}|\geq\frac{1}{4}-\epsilon$ .

This negative result shows that one needs to look at special classes of the class-conditional distributions in order to obtain meaningful and consistent estimates. In the discrete setting, the seminal work of Valiant and Valiant [5] deserves special mention. They constructed an estimator for $\|P-Q\|_{1}$ and showed that when $S/\ln S\lesssim n\lesssim S$ , it achieves $L_{1}$ error $\sqrt{S/(n\ln n)}$ , and it takes at least $n\gg\frac{S}{\ln S}$ samples to achieve consistent estimation of $\|P-Q\|_{1}$ . Valiant and Valiant [6] constructed another estimator of $\|P-Q\|_{1}$ using linear programming which achieves the $L_{1}$ error $\sqrt{\frac{S}{n\ln n}}$ when $n\asymp\frac{S}{\ln S}$ . We argue in this paper that the simplest estimator for $\|P-Q\|_{1}$ , namely plugging in the empirical distribution $P_{n},Q_{n}$ and obtaining $\|P_{n}-Q_{n}\|_{1}$ achieves $L_{1}$ error rate $\sqrt{S/n}$ for $n\gtrsim S$ . In this sense, the optimal estimator seems to enlarge the sample size $n$ to $n\ln n$ in the error rate expression. This phenomenon was termed the effective sample size enlargement in [7].

I-B Approximation-based method

We emphasize that the observed effective sample size enlargement here is another manifestation of the recently discovered phenomenon in functional estimation of high dimensional objects. There has been a recent wave of study on functional estimation of high dimensional parameters [6, 7, 8, 9], and it was shown in Jiao et al. [7] that for a wide class of functional estimation problems (including Shannon entropy $H(P)=\sum_{i=1}^{S}-p_{i}\ln p_{i}$ , $F_{\alpha}\triangleq\sum_{i=1}^{S}p_{i}^{\alpha}$ , and mutual information), there exists a general approximation-based method that can be applied to design minimax rate-optimal estimators whose performance with $n$ samples is essentially that of the MLE (maximum likelihood estimator, or the plug-in estimator) with $n\ln n$ samples.

The general approximation-based method in [7] is as follows. Consider estimating $G(\theta)$ of a parameter $\theta\in\Theta\subset\mathbb{R}^{p}$ for an experiment $\{P_{\theta}:\theta\in\Theta\}$ , with a consistent estimator $\hat{\theta}_{n}$ for $\theta$ , where $n$ is the number of observations. Suppose the functional $G(\theta)$ is analytic111A function $f$ is analytic at a point $x_{0}$ if and only if its Taylor series about $x_{0}$ converges to $f$ in some neighborhood of $x_{0}$ . everywhere except at $\theta\in\Theta_{0}$ . A natural estimator for $G(\theta)$ is $G(\hat{\theta}_{n})$ . In the estimation of functionals of discrete distributions, $\Theta$ is the $S$ -dimensional probability simplex, and a natural candidate for $\hat{\theta}_{n}$ is the empirical distribution, which is unbiased for any $\theta\in\Theta$ .

We propose to conduct the following two-step procedure in estimating $G(\theta)$ .

Classify the Regime: Compute $\hat{\theta}_{n}$ , and declare that we are in the “non-smooth” regime if $\hat{\theta}_{n}$ is “close” enough to $\Theta_{0}$ . Otherwise declare we are in the “smooth” regime; 2. 2.

Estimate:

•

If $\hat{\theta}_{n}$ falls in the “smooth” regime, use an estimator “similar” to $G(\hat{\theta}_{n})$ to estimate $G(\theta)$ ;

•

If $\hat{\theta}_{n}$ falls in the “non-smooth” regime, replace the functional $G(\theta)$ in the “non-smooth” regime by an approximation $G_{\text{appr}}(\theta)$ (another functional) which can be estimated without bias, then apply an unbiased estimator for the functional $G_{\text{appr}}(\theta)$ .

Approaches of this nature appeared before [7] in Lepski, Nemirovski, and Spokoiny [10], Cai and Low [11], Vinck et al. [12], Valiant and Valiant [5]. It was developed independently for entropy estimation by Wu and Yang [8], and the ideas proved to be very fruitful in Acharya et al. [9], Wu and Yang [13], Orlitsky, Suresh, and Wu [14], Wu and Yang [15]. However, we emphasize that in all the examples above except for the $L_{1}$ distance estimator in Valiant and Valiant [5], the functionals considered all take the form $G(\sum_{i=1}^{p}f(\theta_{i}))$ or $G(\int f(p(x))dx)$ , where $p(x)$ is a univariate density or function, and each $\theta_{i}\in\mathbb{R}$ . In particular, the functions $f(\cdot)$ considered are everywhere analytic except at zero, e.g., $x^{\alpha},|x|^{\alpha}$ for $\alpha>0$ and $x\ln x$ . Most of these features are violated in the $L_{1}$ distance estimation problem. If we write $\|P-Q\|_{1}=\sum_{i=1}^{S}f(p_{i},q_{i})$ with $f(x,y)=|x-y|\in C([0,1]^{2})$ , then we have:

a bivariate function $f(x,y)$ in the sum; 2. 2.

a function $f(x,y)$ which is analytic except on a segment $x=y\in[0,1]$ .

As discussed in Jiao et al. [7], approximation of multivariate functions is much more involved than that of univariate functions, and the fact that the “non-smooth” regime is around a line segment here makes the application of the approximation-based method quite difficult: what shape should we use to specify the “non-smooth” regime? We provide a comprehensive answer to this problem in this paper, thereby substantially generalizing the applicability of the approximation-based method and demonstrate the intricacy of functional estimation problems in high dimensions. Our recent work [16] presents the most up-to-date version of the general approximation-based method, which is applied to construct minimax rate-optimal estimators for the KL divergence (also see Bu et al.[17]), $\chi^{2}$ -divergence, and the squared Hellinger distance. The effective sample size enlargement phenomenon holds in all these cases as well.

We emphasize that the complications triggered by the bivariate function $f(x,y)=|x-y|$ make the $L_{1}$ distance estimation problem highly challenging. Indeed, prior to our work, the only known estimators that require sublinear samples were in [5, 6], which achieved $L_{1}$ error $\sqrt{\frac{S}{n\ln n}}$ in the regime of $\frac{S}{\ln S}\lesssim n\lesssim S$ but not the regime $n\gg S$ , and the lower bound was proved for the regime $n\asymp\frac{S}{\ln S}$ , i.e., when the optimal error is a constant. The complete characterization of the minimax rates and the estimator that achieves the minimax rates were unknown prior to this work.

Our main contributions in this paper are the following:

We apply the approximation-based method to construct minimax rate-optimal estimators with computational complexity $O(n\ln n)$ for $\|P-Q\|_{1}$ when $Q$ is known, and show that for any fixed $Q$ , our estimator performs with $n$ samples at least as well as the plug-in estimator with $n\ln n$ samples. Precisely, the performance of the plug-in estimator for any fixed $Q$ is dictated by the functional $\sum_{i=1}^{S}q_{i}\wedge\sqrt{\frac{q_{i}}{n}}$ , while that of the minimax rate-optimal estimator is dictated by the functional $\sum_{i=1}^{S}q_{i}\wedge\sqrt{\frac{q_{i}}{n\ln n}}$ . Furthermore, we show that any plug-in estimator would not achieve the same performance as our algorithm does. As we argue in Lemma 8, for estimating $\|P-Q\|_{1}$ with known $Q$ , for any distribution estimate $\hat{P}$ constructed from the samples from $P$ , the estimator $\|\hat{P}-Q\|_{1}$ does not achieve the minimax rates in the worst case if $\hat{P}$ does not depend on $Q$ . Concretely, the performance of any plug-in rule $\hat{P}$ behaves essentially as the MLE in the worst case. 2. 2.

We generalize the approximation-based method in [7] to construct a minimax rate-optimal estimator for $\|P-Q\|_{1}$ when both $P$ and $Q$ are unknown with computational complexity $O(n\ln^{2}n)$ . We illustrate the novelty of our scheme via the following results:

(a)

The performance of our estimator with $n$ samples is essentially that of the MLE with $n\ln n$ samples. 2. (b)

Any algorithm that only conducts approximation around the origin does not achieve the minimax rates. Indeed, as we argue in Lemma 5, for any algorithm that employs the MLE when $\hat{p}\gtrsim\frac{\ln n}{n},\hat{q}\gtrsim\frac{\ln n}{n}$ cannot achieve the minimax rates when $n\gg S$ . The reason why the estimator of Valiant and Valiant [5] cannot achieve the minimax rates when $n\gg S$ is that [5] did not conduct approximation when $p$ and $q$ are large. One of our key contributions is to figure out how to conduct approximation when $\hat{p}\gtrsim\frac{\ln n}{n},\hat{q}\gtrsim\frac{\ln n}{n}$ and achieve the minimax rates when $n\gg S$ . 3. (c)

Best polynomial approximation is not sufficient for achieving minimax rate-optimality in this problem. As we argue in Lemma 6, any one-dimensional polynomial that achieves the best approximation error rate cannot be used in constructing the optimal estimator, and it is necessary to use a multivariate polynomial with certain pointwise error guarantees. One of our key contributions is to construct a proper multivariate polynomial with desired pointwise approximation error. 4. (d)

Approximation over the union of the “nonsmooth” regime may not work. As we show in Lemma 7, there does not exist a single multivariate polynomial that achieves the desired approximation error over the whole “nonsmooth” regime. Instead, in our approach, we construct polynomial approximations of the function $f(p,q)=|p-q|$ over a random regime that is determined by empirical data. To our knowledge, it is the first time that a random approximation regime approach appears in the functional estimation literature. 5. (e)

Our estimator is agnostic to the potentially unknown support size $S$ , but behaves as well as the minimax rate-optimal estimator that knows the support size $S$ .

The rest of the paper is organized as follows. In Section II and III, we present a thorough performance analysis of the MLE and explicitly construct the minimax rate-optimal estimators, where Section II covers the known $Q$ case and Section III generalizes to the case of unknown $Q$ . Discussions in Section IV highlight the significance and novelty of our approaches by reviewing several other approaches which are shown to be suboptimal. Section V presents the experimental results comparing our schemes with existing approaches. The auxiliary lemmas used throughout this paper are collected in Appendix A. Appendix B contains proofs of the main theorems. Proofs of all the lemmas in the main text and that used in the proofs of the main theorems can be found in Appendix C, where proofs of all the auxiliary lemmas are collected in Appendix D.

Notation: for non-negative sequences $a_{\gamma},b_{\gamma}$ , we use the notation $a_{\gamma}\lesssim_{\alpha}b_{\gamma}$ to denote that there exists a constant $C$ that only depends on $\alpha$ such that $\sup_{\gamma}\frac{a_{\gamma}}{b_{\gamma}}\leq C$ , and $a_{\gamma}\gtrsim b_{\gamma}$ is equivalent to $b_{\gamma}\lesssim a_{\gamma}$ . When the constant $C$ is universal we do not write subscripts for $\lesssim$ and $\gtrsim$ . Notation $a_{\gamma}\asymp b_{\gamma}$ is equivalent to $a_{\gamma}\lesssim b_{\gamma}$ and $b_{\gamma}\lesssim a_{\gamma}$ . Notation $a_{\gamma}\gg b_{\gamma}$ means that $\liminf_{\gamma}\frac{a_{\gamma}}{b_{\gamma}}=\infty$ , and $a_{\gamma}\ll b_{\gamma}$ is equivalent to $b_{\gamma}\gg a_{\gamma}$ . We write $a\wedge b=\min\{a,b\}$ and $a\vee b=\max\{a,b\}$ . Moreover, $\mathsf{poly}_{n}^{d}$ denotes the set of all $d$ -variate polynomials of degree of each variable no more than $n$ , and $E_{n}[f;I]$ denotes the distance of the function $f$ to the space $\mathsf{poly}_{n}^{d}$ in the uniform norm $\|\cdot\|_{\infty,I}$ on $I\subset\mathbb{R}^{d}$ . The space $\mathsf{poly}_{n}^{1}$ is also abbreviated as $\mathsf{poly}_{n}$ . All logarithms are in the natural base. The notation $x\geq\mathcal{Y}$ , where $x$ is a real number and $\mathcal{Y}$ is a set of real numbers, is equivalent to $x\geq y$ for all $y\in\mathcal{Y}$ .

Throughout this paper, we utilize the Poisson sampling model instead of the binomial model, whose minimax risks can be shown to be closely related, as in [7, Lemma 16].

II Divergence Estimation with Known $Q$

First we consider the case where $Q=(q_{1},\cdots,q_{S})$ is known while $P$ is an unknown distribution with support $\mathcal{S}=\{1,2,\cdots,S\}$ . In other words, $\mathcal{P}=\mathcal{M}_{S}$ and $\mathcal{Q}=\{Q\}$ . We analyze the performance of the MLE in this case, and construct the approximation-based minimax rate-optimal estimator.

We utilize the Poisson sampling model, in which we observe a Poisson random vector

[TABLE]

where the coordinates of $\mathbf{X}$ are mutually independent, and $X_{i}\sim\mathsf{Poi}(np_{i})$ . We define $\hat{p}_{i}=\frac{X_{i}}{n}$ as the empirical probabilities.

II-A Performance of the MLE

The MLE serves as a natural estimator for the $L_{1}$ distance which can be expressed as $\|P_{n}-Q\|_{1}=\sum_{i=1}^{S}|\hat{p}_{i}-q_{i}|$ , where $P_{n}=\mathbf{X}/n=(\hat{p}_{1},\hat{p}_{2},\cdots,\hat{p}_{S})$ is the empirical distribution. Since we are using the Poisson sampling mode, we have $n\hat{p}_{i}\sim\mathsf{Poi}(np_{i})$ .

We obtain the upper and lower bounds for the mean squared error of $\|P_{n}-Q\|_{1}$ in the following theorem.

Theorem 1.

The maximum likelihood estimator $\|P_{n}-Q\|_{1}$ satisfies

[TABLE]

We can also lower bound the worst case mean squared error as

[TABLE]

The following corollary is straightforward since $\sup_{Q\in\mathcal{M}_{S}}\sum_{i=1}^{S}q_{i}\wedge\sqrt{\frac{q_{i}}{n}}\asymp\sqrt{\frac{S}{n}}$ when $n\gtrsim S$ .

Corollary 1.

If $n\gtrsim S$ , we have

[TABLE]

Hence, it is necessary and sufficient for the MLE to have $n\gg S$ samples to be consistent in terms of the worst case mean squared error.

II-B Construction of the optimal estimator

We apply our general recipe to construct the minimax rate-optimal estimator. For simplicity of analysis, we conduct the classical “splitting” operation [18] on the Poisson random vector $\mathbf{X}$ , and obtain two independent identically distributed random vectors $\mathbf{X}_{j}=[X_{1,j},X_{2,j},\ldots,X_{S,j}]^{T},j\in\{1,2\}$ , such that each component $X_{i,j}$ in $\mathbf{X}_{j}$ has distribution $\mathsf{Poi}(np_{i}/2)$ , and all coordinates in $\mathbf{X}_{j}$ are independent. For each coordinate $i$ , the splitting process generates a random sequence $\{T_{ik}\}_{k=1}^{X_{i}}$ , $T_{ik}\in\{1,2\}$ , such that $\{T_{ik}\}_{k=1}^{X_{i}}|\mathbf{X}\sim\mathsf{multinomial}(X_{i};(1/2,1/2))$ , and assign $X_{i,j}=\sum_{k=1}^{X_{i}}\mathbbm{1}(T_{ik}=j)$ for $j\in\{1,2\}$ . All the random variables $\{\{T_{ik}\}_{k=1}^{X_{i}}:1\leq i\leq S\}$ are conditionally independent given our observation $\mathbf{X}$ . The “splitted” empirical probabilities are defined as $\hat{p}_{i,j}=X_{i,j}/(n/2)$ . To simplify notation, we redefine $n/2$ as $n$ to ensure that $n\hat{p}_{i,j}\sim\mathsf{Poi}(np_{i}),j=1,2$ . We emphasize that the sampling splitting approach is not conducted in the implementation of the estimator.

We construct two set functions with variable $q$ as input defined as:

[TABLE]

Here $c_{1}>0,c_{1}>c_{3}>0$ are constants that will be determined later. The set $U(q;c_{1})$ is constructed to satisfy the following property:

Lemma 1.

Suppose $n\hat{q}\sim\mathsf{Poi}(nq)$ . Then,

[TABLE]

where the set function $U(q;c_{1})$ is defined in (10).

It is clear that for any $q\in[0,1],U_{1}(q)\subset U(q;c_{1})$ . The constants $c_{1}>0,c_{1}>c_{3}>0$ will be chosen later to make sure that the following three “good” events have overwhelming probability:

[TABLE]

Here $A\Rightarrow B$ represents the logical implication operation that is equivalent to $A^{c}\cup B$ . The intuitions behind the constructions of these “good” events are as follows. Since we use the first half of the samples $\hat{p}_{i,1}$ to classify regime, and would later use three different estimators depending on whether $\hat{p}_{i,1}$ lies to the left, to the right, or inside $U_{1}(q_{i})$ , it is desirable that we can infer the relationship between $p_{i}$ and $q_{i}$ based on the location of $\hat{p}_{i,1}$ . The reason why these events can be controlled to have high probabilities is that we have specifically designed $U_{1}(q)$ to make it a strict subset of the set $U(q;c_{1})$ , and the sets $U(q;c_{1})$ are designed to satisfy Lemma 1, which ensures that the size of $U(q;c_{1})$ is essentially the length of the confidence interval when the empirical probability $\hat{q}$ is observed.

We have the following lemma controlling the probability of these probabilities.

Lemma 2.

Denote the overall “good” event $E=E_{1}\cap E_{2}\cap E_{3}$ , where $E_{1},E_{2},E_{3}$ are defined in (13),(14),(15). Then,

[TABLE]

where

[TABLE]

Now we construct the estimator. In the “smooth” regime, i.e., $\hat{p}\notin U_{1}(q)$ , we simply employ the plug-in estimator to estimate $f(p,q)$ . In the “non-smooth” regime, i.e., $\hat{p}\in U_{1}(q)$ , we need to approximate $f(p,q)$ by another functional which can be estimated without bias. We consider the best polynomial approximation of $f(x,q)$ on $U(q;c_{1})\supset U_{1}(q)$ , which is defined as

[TABLE]

where $\mathsf{poly}_{K}$ denotes the set of polynomials with degree no more than $K$ . Once we obtain $P_{K}(x;q)$ , we can use an unbiased estimate $\tilde{P}_{K}(\hat{p};q)$ such that $\mathbb{E}\tilde{P}_{K}(\hat{p};q)=P_{K}(p;q)$ for $n\hat{p}\sim\mathsf{Poi}(np)$ . As a result, the absolute value of the bias of the estimator $\tilde{P}_{K}(\hat{p};q)$ in the “non-smooth” regime is exactly the approximation error of $P_{K}(x;q)$ in approximating $f(x,q)=|x-q|$ on $U(q;c_{1})$ , which can be significantly smaller than that of the MLE.

Estimator Construction 1.

We use the first half samples to classify regimes and the second half samples for estimation. Denote

[TABLE]

and define

[TABLE]

where $U(q_{i};c_{1})$ and $U_{1}(q_{i})$ are given by (10), (11), $K=c_{2}\ln n$ , and $c_{1},c_{2}>0,c_{3}>0$ are properly chosen constants.

The performance of this estimator is presented in the following theorem.

Theorem 2.

Suppose there exist two constants $c,C$ such that $c\ln S\leq\ln n\leq C\ln\left(\sum_{i=1}^{S}\sqrt{q_{i}}\wedge q_{i}\sqrt{n\ln n}\right)$ . Then, there exists constants $c_{1},c_{2},c_{3}$ depending only on $c,C$ in Construction 1 such that

[TABLE]

In particular, if $\ln n\leq C\ln S$ , we have

[TABLE]

Remark 1.

When we consider the worst case of $Q$ , Theorem 2 assumes that the sample size cannot be too big ( $\ln n\leq C\ln S$ ). It is obvious that an upper bound on the sample size is needed for the statement to be valid: indeed, if no upper bound on the sample size is imposed then in the asymptotic regime ( $S$ fixed, $n\to\infty$ ) the convergence rate is faster than the parametric rate $\frac{1}{n}$ , which is impossible. However, we are not sure that the current upper bound is tight. The reason why we introduced this upper bound is that it is needed to control the variance of our estimator, but the variance bound we have may not be tight.

Compared to existing literature, the schemes by Valiant and Valiant [5, 6] achieved mean squared error $\frac{S}{n\ln n}$ only in the regime of $\frac{S}{\ln S}\lesssim n\lesssim S$ but not the regime $n\gg S$ . The main reason is that [5, 6] did not conduct approximation when $p\geq\frac{\ln n}{n}$ . As our work shows, the key reason behind whether one should conduct approximation or not is not whether the probability $p$ is close to zero or not, but whether the functional has a non-analytic point or not. As we show in Lemma 5 in Section IV, any approach that only conducts approximation when $p$ is small cannot achieve the minimax rates for $n\gg S$ in general.

II-C Minimax lower bound

It was shown in Valiant and Valiant [5] that if $Q$ is the uniform distribution, when $n\asymp\frac{S}{\ln S}$ , the minimax risk of estimating $\|P-Q\|_{1}$ is a constant. We prove a minimax lower bound for every $Q$ , and show that the performance achieved by our estimator in Theorem 2 is minimax rate-optimal for every fixed $Q$ .

Theorem 3.

Suppose there exists a constant $C>0$ such that $\ln n\geq C\ln S,S\geq 2$ . Then, there exists a constant $C^{\prime}>0$ that only depends on $C$ such that if $\sum_{j=1}^{S}q_{j}\wedge\sqrt{\frac{q_{j}}{n\ln n}}\geq C^{\prime}\left(\sqrt{\frac{\ln n}{n}}+\frac{\sqrt{S}\ln n}{n}\right)$ , then

[TABLE]

where the infimum is taken over all possible estimators.

In particular, if there exist constant $c>0,C>0$ such that $n\geq c\frac{S}{\ln S},\ln n\leq C\ln S$ , then

[TABLE]

Combining Theorem 2 and Theorem 3, we have the following theorem.

Theorem 4.

Suppose there exist constants $c>0,C>0$ such that $c\ln S\leq\ln n\leq C\ln\left(\sum_{i=1}^{S}\sqrt{q_{i}}\wedge q_{i}\sqrt{n\ln n}\right),S\geq 2$ . Then,

[TABLE]

In particular, if $n\geq c\frac{S}{\ln S},\ln n\leq C\ln S$ , then

[TABLE]

The estimator in Construction 1 achieves the minimax rates for every fixed $Q$ .

III Divergence Estimation with Unknown $Q$

Now we consider the general case where both $P$ and $Q$ are unknown to us, i.e., $\mathcal{P}=\mathcal{Q}=\mathcal{M}_{S}$ .

We utilize the Poisson sampling model, in which we observe two Poisson random vectors

[TABLE]

where all the coordinates of $\mathbf{X}$ and $\mathbf{Y}$ are mutually independent, and $X_{i}\sim\mathsf{Poi}(np_{i}),Y_{i}\sim\mathsf{Poi}(nq_{i})$ . We introduce the empirical probabilities $\hat{p}_{i}=\frac{X_{i}}{n},\hat{q}_{i}=\frac{Y_{i}}{n}$ .

III-A Performance of the MLE

In this case, the MLE is expressed as $\|P_{n}-Q_{n}\|_{1}=\sum_{i=1}^{S}|\hat{p}_{i}-\hat{q}_{i}|$ . Since $|\|P_{n}-Q_{n}\|_{1}-\|P-Q\|_{1}|\leq\|P_{n}-P\|_{1}+\|Q_{n}-Q\|_{1}$ by the triangle inequality, and $\mathbb{E}|\hat{p}_{i}-\hat{q}_{i}|\geq\mathbb{E}|\hat{p}_{i}-q_{i}|$ by the conditional Jensen’s inequality, Theorem 1 can again be applied here to give the performance of the MLE.

Theorem 5.

If $n\gtrsim S$ , the MLE satisfies

[TABLE]

Hence, the MLE achieves the mean squared error $S/n$ , and requires $n\gg S$ samples to be consistent.

III-B Construction of the optimal estimator

Again we apply our general recipe to construct the optimal estimator, but encounter several new difficulties: $f(x,y)=|x-y|$ is non-analytic on a segment, and both the uncertainty set and the polynomial approximation need to be generalized to the 2D case. We will overcome these obstacles step by step.

For simplicity of analysis, we conduct the classical “splitting” operation [18] on the Poisson random vector $\mathbf{X}$ , and obtain two independent identically distributed random vectors $\mathbf{X}_{j}=[X_{1,j},X_{2,j},\ldots,X_{S,j}]^{T},j\in\{1,2\}$ , such that each component $X_{i,j}$ in $\mathbf{X}_{j}$ has distribution $\mathsf{Poi}(np_{i}/2)$ , and all coordinates in $\mathbf{X}_{j}$ are independent. For each coordinate $i$ , the splitting process generates a random sequence $\{T_{ik}\}_{k=1}^{X_{i}}$ such that $\{T_{ik}\}_{k=1}^{X_{i}}|\mathbf{X}\sim\mathsf{multinomial}(X_{i};(1/2,1/2))$ , and assign $X_{i,j}=\sum_{k=1}^{X_{i}}\mathbbm{1}(T_{ik}=j)$ for $j\in\{1,2\}$ . All the random variables $\{\{T_{ik}\}_{k=1}^{X_{i}}:1\leq i\leq S\}$ are conditionally independent given our observation $\mathbf{X}$ . The splitting operation is similarly conducted for the Poisson random vector $\mathbf{Y}$ independently. The “splitted” empirical probabilities are defined as $\hat{p}_{i,j}=X_{i,j}/(n/2),\hat{q}_{i,j}=Y_{i,j}/(n/2)$ . To simplify notation, we redefine $n/2$ as $n$ to ensure that $n\hat{p}_{i,j}\sim\mathsf{Poi}(np_{i}),j=1,2$ . We emphasize that the sampling splitting approach is not needed for the actual estimator construction.

As usual, first we classify “smooth” and “non-smooth” regimes. Since the function $f(x,y)=|x-y|\in C([0,1]^{2})$ is non-analytic on the segment $x=y\in[0,1]$ , we are looking for the “uncertainty set” $U$ containing this segment such that any $(p,q)\in U$ can be “localized” in the previous sense. We have the following lemma.

Lemma 3.

The two-dimensional set $U\subset[0,1]^{2}$ defined as

[TABLE]

satisfies

[TABLE]

where $U(x;c_{1})$ is given by (10).

We design another set $U_{1}$ as follows:

[TABLE]

where $0<c_{3}<c_{1}$ . Clearly $U_{1}\subset U$ . We choose the constants $c_{1}$ and $c_{3}$ later to ensure that the following four events happen with high probability:

[TABLE]

We have the following lemma controlling the probability of these events happening simultaneously.

Lemma 4.

Denote the overall “good” event $E=E_{1}\cap E_{2}\cap E_{3}\cap E_{4}$ , where $E_{1},E_{2},E_{3},E_{4}$ are defined in (33),(34),(35), (36). Then, assuming $\frac{c_{3}}{c_{1}}<\frac{8}{(\sqrt{2}+1)^{2}}-1\approx 0.373$ ,

[TABLE]

where the constant $\beta$ is given by

[TABLE]

It is evident that we can make $\beta$ in (38) arbitrarily large by taking $c_{1}$ large and keeping $c_{3}/c_{1}$ a small constant. Clearly, if the true parameters $(p,q)\notin U$ , the MLE would be a decent estimator. It suffices to construct estimators when the true parameters $(p,q)\in U$ . The known $Q$ case seems to suggest that we consider the best polynomial approximation of $f(x,y)=|x-y|$ on $U$ . However, this will not work for two reasons:

the entire 2D stripe $U$ is too large for the polynomial approximation error to vanish at the correct rate; 2. 2.

best polynomial approximation in the 2D case is not unique, and may not achieve the desired pointwise error.

We will explore these reasons in details in Section IV. To solve the first problem, we remark that although $U$ is the set such that its element can be localized within $U$ , a specific element $(x,y)\in U$ can be localized in a much smaller subset $U(x;c_{1})\times U(y;c_{1})\subset U$ , where $U(x;c_{1})$ is given by (10). Hence, the approximation regime should be dependent on the empirical observations to fully utilize the available information.

For the second problem, we need to design a specific polynomial with satisfactory pointwise approximation properties. Our approximation recipe is the following. Take $K=c_{2}\ln n$ .

Over the square $\left[0,\frac{2c_{1}\ln n}{n}\right]^{2}$ : we consider the decomposition $|x-y|=(\sqrt{x}+\sqrt{y})|\sqrt{x}-\sqrt{y}|$ and introduce the following two bivariate polynomials $u_{K}(x,y)$ and $v_{K}(x,y)$ to uniformly approximate $\sqrt{x}+\sqrt{y}$ and $|\sqrt{x}-\sqrt{y}|$ , respectively. Specifically, we have

[TABLE]

Then, denote $h_{2K}(x,y)=u_{K}(x,y)v_{K}(x,y)-u_{K}(0,0)v_{K}(0,0)$ , we use the polynomial

[TABLE]

to approximate $|x-y|$ over the square $\left[0,\frac{2c_{1}\ln n}{n}\right]^{2}$ . The polynomial $P_{K}^{(1)}(x,y)$ satisfies $P_{K}^{(1)}(0,0)=0$ . We remove the constant term in the definition of $P_{K}^{(1)}$ to guarantee that the estimator we construct is agnostic to the unknown support size $S$ . In practice, $u_{K}$ and $v_{K}$ can be replaced by the efficiently implementable lowpass filtered Chebyshev expansion [19], which achieves the same uniform error rate as the best polynomial approximation.

Remark 2.

We would like to discuss the intuitions behind our construction of the polynomials $u_{K},v_{K}$ . One observation is that best approximation, which aims at approximating the bivariate function $|p-q|$ over the square $\left[0,\frac{2c_{1}\ln n}{n}\right]^{2}$ under the supremum norm, may not work. Indeed, consider the segment $p+q=\frac{2c_{1}\ln n}{n}$ over $\left[0,\frac{2c_{1}\ln n}{n}\right]^{2}$ , and the function $|p-q|$ over this segment can be viewed as a univariate function, whose best approximation error using degree $K=c_{2}\ln n$ is lower bounded by $\frac{1}{K}\frac{2c_{1}\ln n}{n}$ within a constant factor [20, Chap. 9, Thm. 3.3], which is of order $\frac{1}{n}$ . Hence, the accumulated bias is at least $\frac{S}{n}$ , which results in a worse critical scaling $n\gg S$ rather than the $n\gg\frac{S}{\ln S}$ critical scaling we aim for. The key idea that enabled us to achieve worst case accumulated bias $\sqrt{\frac{S}{n\ln n}}$ is the $P$ and $Q$ are probability measures satisfying $\sum_{i}p_{i}=\sum_{i}q_{i}=1$ . Hence, it suffices to prove a pointwise bound for each individual symbol $\sqrt{\frac{p_{i}+q_{i}}{n\ln n}}+\frac{1}{n\ln n}$ . However, to our knowledge, the study of pointwise bounds for multivariate approximation theory has been limited. The decomposition $|x-y|=|\sqrt{x}-\sqrt{y}|(\sqrt{x}+\sqrt{y})$ is translating the problem of obtaining pointwise bounds to the problem of obtaining uniform bounds. Indeed, the uniform error of approximating $|\sqrt{x}-\sqrt{y}|$ and $\sqrt{x}+\sqrt{y}$ over $\left[0,\frac{2c_{1}\ln n}{n}\right]^{2}$ with degree $K=c_{2}\ln n$ are both of order $\frac{1}{\sqrt{n\ln n}}$ (Lemma 11), and the finite-difference formula $\Delta(ab)=a\Delta b+b\Delta a+(\Delta a)(\Delta b)$ precisely gives us the desired pointwise bound. 2. 2.

Once we can assert with high probability $(p,q)\in U,p+q\geq\frac{c_{1}\ln n}{2n}$ , we utilize the best approximation polynomial of $|t|$ on $[-1,1]$ with order $K$ . Denote it as

[TABLE]

we have

[TABLE]

where $W=\sqrt{\frac{8c_{1}\ln n}{n}}(\sqrt{(\hat{p}_{i,1}+\hat{q}_{i,1})\vee\frac{1}{n}})$ . It is the best approximation polynomial of $|t|$ over interval $[-W,W]$ .

Remark 3.

We discuss the reason why we cannot apply the best approximation polynomial of $|t|$ over the square $\left[0,\frac{2c_{1}\ln n}{n}\right]^{2}$ . Note that the approximation width $W$ is at least of order $\frac{\ln n}{n}$ since $\hat{p}_{i,1}+\hat{q}_{i,1}\gtrsim\frac{\ln n}{n}$ . However, for the square $\left[0,\frac{2c_{1}\ln n}{n}\right]^{2}$ , we easily have $W\ll\frac{\ln n}{n}$ , but $\frac{\ln n}{n}$ is the minimum width which ensures concentration properties (Lemma 3). Indeed, as we show in Lemma 6, any 1D approximation polynomial fails to achieve the pointwise error bound we discussed in Remark 2 over $\left[0,\frac{2c_{1}\ln n}{n}\right]^{2}$ .

Finally, we use the second part of the samples to construct the unbiased estimators for $P_{K}^{(1)}(x,y)$ defined in (41) and $P_{K}^{(2)}(x,y;\hat{p}_{i,1},\hat{q}_{i,1})$ defined in (44). Concretely, we introduce the estimators $\tilde{P}_{K}^{(1)}(\hat{p}_{i,2},\hat{q}_{i,2})$ and $\tilde{P}_{K}^{(2)}(\hat{p}_{i,2},\hat{q}_{i,2};\hat{p}_{i,1},\hat{q}_{i,1})$ such that

[TABLE]

These unbiased estimators are easy to construct since for any $r,s\geq 1,r,s\in\mathbb{Z},(n\hat{p},n\hat{q})\sim\mathsf{Poi}(np)\times\mathsf{Poi}(nq)$ , we have [21, Ex. 2.8]

[TABLE]

The final estimator is presented as follows.

Estimator Construction 2.

As before, use sample splitting to obtain $(\hat{p}_{i,1},\hat{q}_{i,1})$ and $(\hat{p}_{i,2},\hat{q}_{i,2})$ . Denote

[TABLE]

and define

[TABLE]

Here $U$ is given by (30), $U_{1}$ is defined in (32), the estimators $\tilde{P}_{K}^{(1)}$ and $\tilde{P}_{K}^{(2)}$ are defined in (45) and (46) $K=c_{2}\ln n$ , and $c_{1}>c_{3}>c_{2}>0$ are properly chosen constants, $\frac{c_{3}}{c_{1}}<\frac{8}{(\sqrt{2}+1)^{2}}-1\approx 0.373$ .

A pictorial explanation of the estimator construction is given in Fig 1. Concretely, we use the first sample to classify into four regimes, and in each regime we do the following operations:

Regime I: plug-in: $\hat{p}_{2}-\hat{q}_{2}$ 2. 2.

Regime II: plug-in: $\hat{q}_{2}-\hat{p}_{2}$ 3. 3.

Regime III: 2D polynomial approximation of $|p-q|$ 4. 4.

Regime IV: 1D polynomial approximation of $|t|$ where $t=p-q$ with width $\sqrt{\frac{8c_{1}\ln n}{n}}\sqrt{\left(\hat{p}_{1}+\hat{q}_{1}\right)\vee\frac{1}{n}}$

The next theorem presents the performance of $\hat{L}^{(2)}$ .

Theorem 6.

Suppose there exists a constant $C>0$ such that $\ln n\leq C\ln S$ . Then, there exists $c_{1},c_{2},c_{3}$ that only depend on $C$ in Construction 2 such that

[TABLE]

We note that the lower bound for the known $Q$ case also serves as a lower bound for the unknown $Q$ case. Indeed, when $Q$ is known, we can then produce $n$ i.i.d. samples from $Q$ and feed it into any algorithm that handles the unknown $Q$ case. Hence, Theorem 3 and Theorem 6 yield that $\hat{L}^{(2)}$ is minimax rate-optimal. Note that $\hat{L}^{(2)}$ achieves the minimax rate without knowing the support size $S$ a priori. Moreover, the effective sample size enlargement effect holds again: the performance of the optimal estimator with $n$ samples is essentially that of the MLE with $n\ln n$ samples.

IV Comparison with Other Approaches

In this section, we review some other possible approaches in estimating the $L_{1}$ distance, and apply approximation theory to argue the strict suboptimality of some approaches.

IV-A Approximation only around the origin

In the previous papers [5, 6, 7, 8, 9] in estimating entropy, power sum, mutual information, etc, approximation is conducted only around the origin. However, we remark that this is insufficient in estimating the $L_{1}$ distance. We have the following result.

Lemma 5.

Let $\hat{L}$ denote an estimator of $\|P-Q\|_{1}$ that satisfies the following:

[TABLE]

where the estimator $g(\hat{p}_{i},\hat{q}_{i})\in[-B,B]$ is a bounded function that satisfies $g(\hat{p}_{i},\hat{q}_{i})=|\hat{p}_{i}-\hat{q}_{i}|$ when $(\hat{p}_{i},\hat{q}_{i})\notin\left[0,\frac{2c_{1}\ln n}{n}\right]^{2}$ , $g(0,0)=0$ . Suppose $n\gg S$ . Then,

[TABLE]

Lemma 5 explains the reason why the estimator of Valiant and Valiant [5] can only achieve the optimal error rate when $n\lesssim S\lesssim n\ln n$ , but ours achieves the optimal error rate for a much large set of parameter configurations.

IV-B One-dimensional approximation in the 2D case

In the construction of $\hat{L}^{(2)}$ , we split into two cases when $(\hat{p},\hat{q})\in U_{1}$ , i.e., 1D approximation of $|t|$ via the substitution $t=x-y$ if $\hat{p}+\hat{q}>c_{1}\ln n/n$ , and the decomposition of $|x-y|$ into $(\sqrt{x}+\sqrt{y})|\sqrt{x}-\sqrt{y}|$ otherwise. Can we always do 1D approximation of $|t|$ with $t=x-y$ to achieve the desired approximation error, i.e., propose some $P(t)\in\mathsf{poly}_{K}$ with $K\asymp\ln n$ and $|P(t)-|t||\lesssim\sqrt{|t|/(n\ln n)}+\frac{1}{n\ln n}$ for any $|t|\leq c_{1}\ln n/n$ ? We have the following lemma regarding the approximation of $|t|$ .

Lemma 6.

If $Q_{K}(t)\in\mathsf{poly}_{K}$ is even with $Q_{K}(0)=0$ , and achieves the best uniform error rate $\max_{t\in[-1,1]}|Q_{K}(t)-|t||\lesssim 1/K$ , we have

[TABLE]

Now we apply Lemma 6 to the hypothetical polynomial $P(t)$ . Doing parameter substitution $t=\frac{c_{1}\ln n}{n}y,y\in[-1,1]$ , by assumption we have for any $y\in[-1,1]$ ,

[TABLE]

where $K\asymp\ln n$ . It follows from Jensen’s inequality that

[TABLE]

Define $Q(y)=\frac{n}{c_{1}\ln n}\left(P\left(\frac{c_{1}\ln n}{n}y\right)+P\left(-\frac{c_{1}\ln n}{n}y\right)\right)/2$ . It is clear that $Q(y)$ satisfies the assumptions in Lemma 6. Hence,

[TABLE]

However, it contradicts the upper bound (55) when $\frac{1}{K^{2}}\ll|y|\ll\frac{1}{K}$ . Hence, any 1D approximation does not achieve the error rate that is achieved by our 2D approximation approach.

IV-C Approximation on the entire 2D stripe

In the unknown $Q$ case we have decomposed the stripe $U$ into subsets where polynomial approximations take place. Is it possible that we use a single polynomial $P(x,y)\in\mathsf{poly}_{K}^{2}$ of degree $K\asymp\ln n$ to approximate $|x-y|$ such that $|P(x,y)-|x-y||\lesssim\sqrt{(x+y)/(n\ln n)}$ for any $(x,y)\in U$ ? We prove that the answer is negative even for $U^{\prime}=\cup_{x\in[c_{1}\ln n/n,t_{n}]}U(x;c_{1})\times U(x;c_{1})\subset U$ and any $t_{n}\gg(\ln n)^{3}/n$ .

Lemma 7.

If $(\ln n)^{3}/n\ll t_{n}\leq 1/2$ , $K\asymp\ln n$ , we have

[TABLE]

Lemma 7 shows that for a too large set $U^{\prime}$ (e.g., $U^{\prime}=U$ ), every polynomial fails to achieve the desired approximation error bound $\sqrt{(x+y)/(n\ln n)}$ . Hence, it is necessary to make the approximation regime be random and dependent on the empirical observations.

IV-D The failure of any plug-in estimator

It is evident that the optimal $L_{1}$ distance estimators we constructed heavily exploit the interactions of $P$ and $Q$ . For example, in the known $Q$ case, the estimator for $\|P-Q\|_{1}$ is not of the form $\|g(P_{n})-Q\|_{1}$ , where $g(\cdot)$ is an arbitrary function of the empirical distribution of $P$ that is independent of $Q$ .

We show that for any estimator $g(P_{n})$ of the distribution $P$ , the plug-in estimator $\|g(P_{n})-Q\|_{1}$ does not achieve the minimax rates in estimating $\|P-Q\|_{1}$ when one considers the worst cases among all $P,Q\in\mathcal{M}_{S}$ .

Lemma 8.

Consider the known $Q$ case. Suppose $g(P_{n})\in\mathbb{R}^{S}$ is an arbitrary function of the empirical distribution $P_{n}$ , and $g(\cdot)$ does not depend on $Q$ . Then, if $n\gtrsim S$ ,

[TABLE]

Lemma 8 shows that since the plug-in estimator $\|g(P_{n})-Q\|_{1}$ does not explicitly exploit the nonsmoothness of the function $\|P-Q\|_{1}$ , in the worst case it behaves essentially like the maximum likelihood estimator as shown in Corollary 1.

V Experimental Results

In this section, we compare the empirical performances of our algorithms with the following approaches:

•

maximum likelihood estimator (MLE): it is the approach of plugging-in the empirical distributions obtained through samples into the functional. As shown in Section II and III, it does not achieve the minimax rates in estimating $\|P-Q\|_{1}$ in general in both the known $Q$ and unknown $Q$ cases.

•

Valiant–Valiant estimator [6]: [6] released Matlab code corresponding to their estimator of $\|P-U_{S}\|_{1}$ , which is proved to achieve the minimax rates when $n\asymp\frac{S}{\ln S}$ , i.e., when the optimal error is a constant. Here $U_{S}$ denotes the uniform distribution with support size $S$ .

•

approximate profile maximum likelihood estimator (APML) [22]: the APML estimator is an approximate solution of the profile maximum likelihood estimator [23], which can be applied to estimate $\|P-U_{S}\|_{1}$ , and $\|P-Q\|_{1}$ when both $P$ and $Q$ are unknown. It was shown in [22] that the APML estimator exhibits generally good empirical performances, albeit its theoretical properties are not yet understood well.

In the sequel, for each true distribution pair $(P,Q)$ , we fix the parameters in our estimators and vary the sample sizes to compare the estimation performances. We use the root mean squared error (RMSE) as the evaluation criterion.

Figure 2 compares the four approaches mentioned above in estimating $\|P-U_{S}\|_{1}$ , which is also called “distance to uniformity”. We see that our algorithm is consistently better than the maximum likelihood estimator, and is competitive with the VV estimator [6] and APML estimator [22]. Our estimator has computational complexity $O(n\ln n)$ . Indeed, in the worst case, we may need to evaluate a polynomial with degree $\ln n$ for each sample, which results in an overall $O(n\ln n)$ computational complexity.

Figure 3 compares the performances of the maximum likelihood estimator (MLE), our estimator, and APML in estimating $\|P-Q\|_{1}$ when both $P$ and $Q$ are unknown. Note that we did not choose to compare with [6] since there is no code available for their algorithm in the unknown $Q$ setting. We find our algorithm to perform consistently better than the maximum likelihood estimator, and is particularly competitive when the distributions $P$ and $Q$ are quite different from each other. Our estimator has computational complexity $O(n\ln^{2}n)$ in the $Q$ unknown setting. In the worst case, we may need to evaluate a bivariate polynomial with degree $\ln n$ in each variable for each sample, which results in an overall $O(n\ln^{2}n)$ computational complexity.

VI Acknowledgements

We are grateful to Vilmos Totik for discussing multivariate approximation theory, and for the insights that motivated the proof of Lemma 7. We would like to thank Gregory Valiant for discussing the estimator in [5]. We are grateful to the associated editor and the anonymous reviewers for constructive comments that have helped significantly improved the presentation of the paper. We thank Irena Fischer-Hwang and Banghua Zhu for the help in preparing the experimental results in Section V.

Appendix A Auxiliary Lemmas

The first-order symmetric difference of a function $f$ is given by

[TABLE]

while the second order symmetric difference is given by

[TABLE]

Analogously, the $r$ -th order symmetric difference can be defined, and it is zero when $[x,x+rh]$ or $[x-rh,x]$ are not inside the domain of $f$ .

For function $f(x)$ with domain $[0,1]$ , $\varphi(x)=\sqrt{x(1-x)}$ , the first-order Ditzian–Totik modulus of smoothness is defined as

[TABLE]

and the second-order Ditzian–Totik modulus of smoothness is defined as

[TABLE]

Similarly, we can also define the $r$ -th order Ditzian–Totik modulus of smoothness for a function $f(x)$ with domain $[0,1]^{2}$ :

[TABLE]

where $\Delta_{i,h}$ denotes the symmetric difference with respect to the $i$ -th coordinate.

The next lemma upper bounds the best polynomial approximation error by the Ditzian-Totik moduli.

Lemma 9.

[24, Thm. 7.2.1, Thm. 12.1.1.]** There exists a constant $M(r)>0$ such that for any function $f\in C[0,1]$ ,

[TABLE]

where $E_{n}[f;I]$ denotes the distance of the function $f$ to the space $\mathsf{poly}_{n}$ in the uniform norm $\|\cdot\|_{\infty,I}$ on $I\subset\mathbb{R}$ . Moreover, if $f(x):[0,1]^{2}\mapsto\mathbb{R}$ , we have

[TABLE]

for any $r<n$ , where $M$ is independent of $f$ and $n$ , and $E_{n}[f;[0,1]^{2}]$ denotes the distance of the function $f$ to the space $\mathsf{poly}_{n}^{2}$ in the uniform norm on $[0,1]^{2}$ .

The modulus $\omega_{\varphi}^{2}(f,t)$ is computed for a variety of functions in the following lemma.

Lemma 10.

[24, Chap. 3.4]** Suppose $f(x)=x^{\delta},0<\delta<1,x\in[0,1]$ . Then,

[TABLE]

where $\omega_{\varphi}^{1}(f,t)$ is defined in (61), $\omega_{\varphi}^{2}(f,t)$ is defined in (62).

Lemma 11.

Suppose $f(x;a)=|\sqrt{x}-\sqrt{a}|,x\in[0,1]$ , and $a\in[0,1]$ is a parameter. Then,

[TABLE]

Next lemma computes the Ditzian–Totik modulus for function $f(x)=|2x\Delta-q|,x\in[0,1]$ .

Lemma 12.

Suppose $f(x)=|2x\Delta-q|,\Delta>0,0\leq q\leq 2\Delta,x\in[0,1]$ . Then, for any integer $K\geq 1$ ,

[TABLE]

where $\omega_{\varphi}^{2}(f,t)$ is defined in (62).

Lemma 13 (Markov’s inequality).

[20, Chap 4, Thm 1.4]** Suppose $P_{n}\in\mathsf{poly}_{n}$ is defined on $[-1,1]$ . Then,

[TABLE]

Lemma 14.

[24, Thm. 7.3.1.]** For $P_{n}$ the best $n$ -th degree polynomial approximation to $f$ in $[0,1]$ and an integer $r\in\{1,2\}$ we have

[TABLE]

where $\varphi(x)=\sqrt{x(1-x)}$ and $M$ is independent of $n$ and $f$ .

The next lemma shows that a polynomial on $[-1,1]$ nearly attains its supremum norm in a slightly smaller interval contained in $[-1,1]$ .

Lemma 15.

[24, Thm. 8.4.8.]** Suppose $c>0$ is a constant, $P_{n}\in\mathsf{poly}_{n}$ defined on $[-1,1]$ , $n^{2}>c$ . Then, there exists a constant $M(c)>0$ that does not depend on $n$ and $P_{n}$ such that

[TABLE]

Lemma 16.

Suppose $P_{K}(x)$ is the best approximation polynomial with order $K$ of function $f(x)\in C[0,1]$ defined as

[TABLE]

Then, the best approximation polynomial with order $2K$ of function $f(z^{2}),z\in[-1,1]$ is given by $P_{K}(z^{2})$ .

The following lemma characterizes the upper bounds of the coefficients of a bounded real polynomial.

Lemma 17.

[16]** Let $p_{n}(x)=\sum_{\nu=0}^{n}a_{\nu}x^{\nu}$ be a polynomial of degree at most $n$ such that $|p_{n}(x)|\leq A$ for $x\in[a,b]$ . Then

If $a+b\neq 0$ , then

[TABLE] 2. 2.

If $a+b=0$ , then

[TABLE]

The following lemma gives an upper bound for the second moment of the unbiased estimate of $(p-q)^{j}$ in Poisson model.

Lemma 18.

Suppose $nX\sim\mathsf{Poi}(np),p\geq 0,q\geq 0$ . Then, the estimator

[TABLE]

is the unique uniformly minimum variance unbiased estimator for $(p-q)^{j},j\geq 0,j\in\mathbb{N}$ , and its second moment is given by

[TABLE]

where $L_{m}(x)$ stands for the Laguerre polynomial with order $m$ , which is defined as:

[TABLE]

If $M\geq\max\left\{\frac{n(p-q)^{2}}{p},j\right\}$ , we have

[TABLE]

When $k=0,\prod_{h=0}^{k-1}\left(X-\frac{h}{n}\right)\triangleq 1.$ When $p=0$ , $g_{j,q}(X)\equiv(-q)^{j},\mathbb{E}[g_{j,q}(X)]^{2}\equiv q^{2j}$ .

We construct the unbiased estimator of $(p-q)^{j},j\geq 0$ when both $p$ and $q$ are unknown as in the following lemma.

Lemma 19.

Suppose $(n\hat{p},n\hat{q})\sim\mathsf{Poi}(np)\times\mathsf{Poi}(nq)$ . Then, the following estimator using $(\hat{p},\hat{q})$ is the unique uniformly minimum variance unbiased estimator for $(p-q)^{j},j\geq 0,j\in\mathbb{Z}$ :

[TABLE]

Furthermore,

[TABLE]

The following lemma characterizes the behavior of the central moments of Poisson distributions.

Lemma 20.

Suppose $n\hat{p}\sim\mathsf{Poi}(np)$ . Then, for any integer $s\geq 2$ , there exist $\lfloor s/2\rfloor$ constants $h_{j,s}$ that are independent of $n$ , such that

[TABLE]

Furthermore,

[TABLE]

Consequently, there exists a constant $C_{s}>0$ depending only on $s$ satisfying $(C_{s})^{1/s}\lesssim\frac{s}{\ln s}$ such that for any $s\geq 2$ an even integer,

[TABLE]

For $s\geq 1$ odd integer, we have

[TABLE]

We emphasize that the scaling $\left(\frac{s}{\ln s}\right)^{s}$ is consistent with the general moment bounds in [25]. However, the results in [25] do not directly apply here. Furthermore, Lemma 20 provides bounds on each individual $h_{j,s}$ , which is not obtainable from a general moment bound.

The next lemma controls the moments of $\frac{1}{\hat{p}\vee 1/n}$ , where $n\hat{p}\sim\mathsf{Poi}(np)$ .

Lemma 21.

Suppose $n\hat{p}\sim\mathsf{Poi}(np)$ . Then, for any integer $j\geq 0$ , there exists a constant $B_{j}$ depending only on $j$ such that

[TABLE]

One may take $B_{j}=j\left(\frac{j}{e}\right)^{j}+1+j2^{j+1}+j\left(\frac{16(j+1)}{e}\right)^{j+1}$ .

The following lemma gives well-known tail bounds for Poisson and binomial random variables.

Lemma 22.

[26, Exercise 4.7]** If $X\sim\mathsf{Poi}(\lambda)$ or $X\sim\mathsf{B}(n,\frac{\lambda}{n})$ , then for any $\delta>0$ , we have

[TABLE]

The following lemma presents the Hoeffding bound.

Lemma 23.

[27]** Let $X_{1},X_{2},\ldots,X_{n}$ be independent random variables such that $X_{i}$ takes its value in $[a_{i},b_{i}]$ almost surely for all $i\leq n$ . Let $S_{n}=\sum_{i=1}^{n}X_{i}$ , we have for any $t>0$ ,

[TABLE]

The following lemma provides sharp estimates of $\mathbb{E}|\hat{q}-q|$ , where $n\hat{q}\sim\mathsf{Poi}(nq)$ , which can be viewed as an analog of the binomial case studied in [28].

Lemma 24.

Suppose $n\hat{q}\sim\mathsf{Poi}(nq)$ . Then,

[TABLE]

Hence,

[TABLE]

Lemma 25.

Suppose $n\hat{q}\sim\mathsf{Poi}(nq)$ . Then, for any $p\geq 0$ ,

[TABLE]

Further,

[TABLE]

The next lemma upper bounds the variance of $|\hat{q}-p|,n\hat{q}\sim\mathsf{Poi}(nq)$ .

Lemma 26.

Suppose $n\hat{q}\sim\mathsf{Poi}(nq)$ . Then, for any $p\geq 0$ ,

[TABLE]

Appendix B Proofs of main theorems

B-A Proof of Theorem 1

We have

[TABLE]

Hence,

[TABLE]

where we applied Lemma 25.

To analyze the variance, due to the mutual independence of $\{\hat{p}_{i},1\leq i\leq S\}$ , we have

[TABLE]

where we used Lemma 26 in the second step.

The proof of the upper bound is complete. Regarding the lower bound, setting $P=Q$ , we have

[TABLE]

B-B Proof of Theorem 2

The following lemma gives the bias and variance bound of $\tilde{P}_{K}(\hat{p};q)$ .

Lemma 27.

For $n\hat{p}\sim\mathsf{Poi}(np)$ with $p\in U(q;c_{1})$ , there exists a universal constant $B>0$ such that

[TABLE]

where $\tilde{P}_{K}(\hat{p};q)$ is the unique uniformly minimum variance unbiased estimate of $P_{K}(x;q)$ defined in (18), $U(q;c_{1})$ is defined in (10) and $K=c_{2}\ln n,c_{2}<c_{1}$ .

Proof.

Recall the “good” events $E_{1},E_{2},E_{3}$ defined in (13),(14),(15) and define $E=E_{1}\cap E_{2}\cap E_{3}$ . We have

[TABLE]

where we have applied Lemma 2.

Define the random variables

[TABLE]

where the random index sets $I_{1},I_{2},I_{3}$ are defined as

[TABLE]

The indices $I_{1},I_{2},I_{3}$ are independent of the random variables $\{\hat{p}_{i,2}:1\leq i\leq S\}$ . Since

[TABLE]

it follows from Cauchy’s inequality that

[TABLE]

where $\beta$ is defined in (17).

It follows from the law of total variance that

[TABLE]

where we have used the fact that $\mathbb{E}[\mathcal{E}_{1}|I_{1}]=0$ with probability one and Lemma 26. Similarly we have $\mathbb{E}\mathcal{E}_{2}^{2}\leq n^{-1}$ .

Regarding $\mathbb{E}\mathcal{E}_{3}^{2}$ , it follows from Lemma 27 and the mutual independence of $\{\hat{p}_{i,2}:1\leq i\leq S\}$ that

[TABLE]

where $\epsilon=c_{2}\ln B$ .

Hence,

[TABLE]

where $\beta$ is defined in (17) and $\epsilon=c_{2}\ln B$ .

If $\ln n\gtrsim\ln S$ , one may choose $c_{1}$ large enough and $c_{3}=c_{1}/2$ to ensure that $\frac{S}{n^{\beta}}\lesssim\frac{\ln n}{n^{1-\epsilon}}$ . When $\ln n\lesssim\ln\left(\sum_{i=1}^{S}\sqrt{q_{i}}\wedge q_{i}\sqrt{n\ln n}\right)$ , one may choose $c_{2}$ small enough to ensure that $\frac{\ln n}{n^{1-\epsilon}}\lesssim\left(\sum_{i=1}^{S}q_{i}\wedge\sqrt{\frac{q_{i}}{n\ln n}}\right)^{2}$ . The worst case of $Q$ result is proved upon noting that

[TABLE]

In the worst case of $Q$ we no longer need the condition $\ln n\gtrsim\ln S$ since we can ensure $\frac{S}{n^{\beta}}\lesssim\frac{S}{n\ln n}$ if we take $c_{1}$ large enough and $c_{3}=c_{1}/2$ .

∎

B-C Proof of Theorem 3

The main tool we employ is the so-called method of two fuzzy hypotheses presented in Tsybakov [29]. Suppose we observe a random vector ${\bf Z}\in(\mathcal{Z},\mathcal{A})$ which has distribution $P_{\theta}$ where $\theta\in\Theta$ . Let $\sigma_{0}$ and $\sigma_{1}$ be two prior distributions supported on $\Theta$ . Write $F_{i}$ for the marginal distribution of $\mathbf{Z}$ when the prior is $\sigma_{i}$ for $i=0,1$ . Let $\hat{T}=\hat{T}({\bf Z})$ be an arbitrary estimator of a function $T(\theta)$ based on $\bf Z$ . We have the following general minimax lower bound.

Lemma 28.

[29, Thm. 2.15]** Given the setting above, suppose there exist $\zeta\in\mathbb{R},s>0,0\leq\beta_{0},\beta_{1}<1$ such that

[TABLE]

If $\mathsf{TV}(F_{1},F_{0})\leq\eta<1$ , then

[TABLE]

where $F_{i},i=0,1$ are the marginal distributions of $\mathbf{Z}$ when the priors are $\sigma_{i},i=0,1$ , respectively.

Here $\mathsf{TV}(P,Q)$ is the total variation distance between two probability measures $P,Q$ on the measurable space $(\mathcal{Z},\mathcal{A})$ . Concretely, we have

[TABLE]

where $p=\frac{dP}{d\nu},q=\frac{dQ}{d\nu}$ , and $\nu$ is a dominating measure so that $P\ll\nu,Q\ll\nu$ .

The following lemma was shown in Cai and Low [11]:

Lemma 29.

For any given even integer $L>0$ , there exist two probability measures $\nu_{0}$ and $\nu_{1}$ on $[-1,1]$ that satisfy the following conditions:

$\nu_{0}$ * and $\nu_{1}$ are symmetric around [math];* 2. 2.

$\int t^{l}\nu_{1}(dt)=\int t^{l}\nu_{0}(dt)$ , for $l=0,1,2,\ldots,L$ ; 3. 3.

$\int|t|\nu_{1}(dt)-\int|t|\nu_{0}(dt)=2E_{L}[|t|;[-1,1]]$ ,

where $E_{L}[|t|;[-1,1]]$ is the distance in the uniform norm on $[-1,1]$ from the absolute value function $|t|$ to the space $\mathsf{poly}_{L}$ .

It is known that $E_{L}[|t|;[-1,1]]=\beta_{*}L^{-1}(1+o(1))$ , where $\beta_{*}\approx 0.2802$ is the Bernstein constant [30].

The following lemma deals with the approximation theoretic properties of function $\frac{|x-a|-a}{x}$ .

Lemma 30.

For any function $f(x;a)=\frac{|x-a|-a}{x},x\in[0,1]$ , there exists a universal constant $D>0$ such that

[TABLE]

where $E_{L}[f;I]$ denotes the distance in the uniform norm on interval $I$ from the function $f$ to the space $\mathsf{poly}_{L}$ .

Similar to Lemma 29, the next lemma constructs two measures for the function $f(x;a)=\frac{|x-a|-a}{x}$ . The proof is essentially identical to that of Lemma 29.

Lemma 31.

For any $0<\eta<1$ and positive integer $L>0$ , $f(x;a)=\frac{|x-a|-a}{x},a\in[0,1]$ , there exist two probability measures $\nu_{1}^{\eta,a},\nu_{0}^{\eta,a}$ on $[\eta,1]$ such that

$\int t^{l}\nu_{1}^{\eta,a}(dt)=\int t^{l}\nu_{0}^{\eta,a}(dt)$ , for all $l=0,1,2,\ldots,L$ ; 2. 2.

$\int f(t;a)\nu_{1}^{\eta,a}(dt)-\int f(t;a)\nu_{0}^{\eta,a}(dt)=2E_{L}[f(x;a);[\eta,1]]$ ,

where $E_{L}[f(x;a);[\eta,1]]$ is the distance in the uniform norm on $[\eta,1]$ from the function $f(x;a)$ to the space $\mathsf{poly}_{L}$ .

The next lemma is an extension of [8, Lemma 3].

Lemma 32.

Suppose $U_{0},U_{1}$ are two random variables supported on $[a-M,a+M]$ , where $a\geq M\geq 0$ are constants. Suppose $\mathbb{E}[U_{0}^{j}]=E[U_{1}^{j}],0\leq j\leq L$ . Denote the marginal distribution of $X$ where $X|\lambda\sim\mathsf{Poi}(\lambda),\lambda\sim U_{i}$ as $F_{i}$ . If $L+1\geq(2eM)^{2}/a$ , then

[TABLE]

where $\mathsf{TV}(F_{0},F_{1})$ is the total variation distance defined in (132).

We consider the set of approximate probability vectors

[TABLE]

with some constant $\epsilon>0$ . We further define the minimax risk under the Poisson sampling model with respect to $\mathcal{M}_{S}(\epsilon)$ with a fixed $Q$ as

[TABLE]

The following lemma relates $R_{P}(S,n,Q,\epsilon)$ to $R_{P}(S,n,Q,0)$ .

Lemma 33.

For any $S,n\in\mathbb{N}_{+},0<\epsilon<1$ and any distribution $Q\in\mathcal{M}_{S}$ , we have

[TABLE]

Now we are ready to prove our main minimax lower bound.

Proof.

Fix the distribution $Q\in\mathcal{M}_{S}$ . Without loss of generality we assume that $q_{S}=\min_{1\leq j\leq S}q_{j}$ . We construct two probability measures $\bm{\mu}_{0},\bm{\mu}_{1}$ on the distribution $P$ that will later be used in Lemma 28. Concretely, we use an independent prior generation, and set

[TABLE]

In other words, we assign independent priors $\mu_{i}^{(q_{j})}$ to each symbol $p_{j},1\leq j\leq S-1$ , and assign a delta mass at $1-\gamma$ to the symbol $p_{S}$ . The constant $\gamma$ will later be set to

[TABLE]

where $D$ is the universal constant in Lemma 30, and $c\in(0,1)$ is a constant.

Now we construct $\mu_{i}^{(q)},i\in\{0,1\}$ for a generic $q\in(0,1)$ . We consider two different cases.

$0<q\leq\frac{c\ln n}{n}$ , where $c\in(0,1)$ is a constant. We first construct two new probability measures $\tilde{\nu}_{i}^{\eta,a},i=0,1$ from the two probability measures constructed in Lemma 31. For $i=0,1$ , the restriction of $\tilde{\nu}_{i}^{\eta,a}$ is absolutely continuous with respect to $\nu_{i}$ , with the Radon–Nikodym derivative given by

[TABLE]

and $\tilde{\nu}_{i}^{\eta,a}(\{0\})=1-\tilde{\nu}_{i}^{\eta,a}([\eta,1])\geq 0$ . Hence, $\tilde{\nu}_{i}^{\eta,a},i=0,1$ are probability measures on $[0,1]$ , with the following properties:

(a)

$\int t\tilde{\nu}_{1}^{\eta,a}(dt)=\int t\tilde{\nu}_{0}^{\eta,a}(dt)=\eta$ ; 2. (b)

$\int t^{l}\tilde{\nu}_{1}^{\eta,a}(dt)=\int t^{l}\tilde{\nu}_{0}^{\eta,a}(dt)$ , for all $l=2,3,\ldots,L+1$ ; 3. (c)

$\int|x-a|\tilde{\nu}_{1}^{\eta,a}(dt)-\int|x-a|\tilde{\nu}_{0}^{\eta,a}(dt)=2\eta E_{L}[f(x;a);[\eta,1]]$ .

The construction of the Radon–Nikodym derivatives are inspired by Wu and Yang [8]. Define

[TABLE]

where $D$ is the universal constant in Lemma 30 and $d_{2}>1$ is a constant. It follows from the assumption that $0<a\leq\frac{1}{2}$ . Let $g(x)=Mx$ and let $\mu_{i}^{(q)}$ be the measures on $[0,M]$ defined by $\mu^{(q)}_{i}(A)=\tilde{\nu}_{i}^{\eta,a}(g^{-1}(A))$ for $i=0,1$ . It then follows that

[TABLE]

[TABLE] 2. 2.

$q>\frac{c\ln n}{n}$ , where $c\in(0,1)$ is a constant. Define function $g(x)=q+\sqrt{\frac{cq\ln n}{n}}x$ , where $x\in[-1,1]$ . Let $\nu_{i},i=0,1$ be the two measures constructed in Lemma 29. We define two new measures $\mu^{(q)}_{i},i=0,1$ by $\mu^{(q)}_{i}(A)=\nu_{i}(g^{-1}(A))$ . Let

[TABLE]

It then follows that

[TABLE]

Since we have set $p_{S}=1-\gamma$ , where $\gamma$ is defined in (140), it is clear that

[TABLE]

Now the construction of the two priors $\bm{\mu_{0}}$ and $\bm{\mu_{1}}$ are complete. In light of Lemma 33, it suffices to lower bound $R_{P}(S,n,Q,\epsilon)$ to give a lower bound to $R_{P}(S,n,Q,0)$ .

Let

[TABLE]

We know from (146) and (151) that

[TABLE]

since we have assumed that $q_{S}=\min_{1\leq j\leq S}q_{j}$ .

For $i=0,1$ , introduce the events

[TABLE]

It follows from the union bound that

[TABLE]

Introduce

[TABLE]

It follows from the Hoeffing inequality in Lemma 23 that

[TABLE]

The last step follows from the arguments below. Note that we assumed $c\in(0,1),d_{2}>1,\sum_{j=1}^{S}q_{j}\wedge\sqrt{\frac{q_{j}}{n\ln n}}\geq C^{\prime}\left(\sqrt{\frac{\ln n}{n}}+\frac{\sqrt{S}\ln n}{n}\right)$ . We have

[TABLE]

Hence, it suffices to take $C^{\prime}$ large enough to ensure that $\bm{\mu}_{i}[(E_{i})^{c}]\to 0,i=0,1$ .

Denote by $\pi_{i}$ the conditional distribution defined as

[TABLE]

Now consider $\pi_{0},\pi_{1}$ as two priors and denote the corresponding marginal distributions on the observations $(X_{1},X_{2},\ldots,X_{S})$ as $F_{0},F_{1}$ . Note that $X_{j}\sim\mathsf{Poi}(np_{j})$ . Setting

[TABLE]

we have $\beta_{0}=\beta_{1}=0$ in Lemma 28. The total variation distance is then upper bounded as

[TABLE]

where $G_{i}$ is the marginal distribution of the observations under priors $\bm{\mu}_{i}$ . It follows from Lemma 32 and the fact that $\mathsf{TV}(\otimes_{i=1}^{S}P_{i},\otimes_{i=1}^{S}Q_{i})\leq\sum_{i=1}^{S}\mathsf{TV}(P_{i},Q_{i})$ that

[TABLE]

since we have assumed $\ln n\geq C\ln S$ , and we ensure $\mathsf{TV}(G_{0},G_{1})$ by taking $d_{2}$ large enough.

It follows from Lemma 28 and Markov’s inequality that

[TABLE]

which together with Lemma 33 implies that

[TABLE]

as long as we choose the constants $d_{2}$ large enough to guarantee that $\chi\leq 5$ .

∎

B-D Proof of Theorem 6

We first present the performance of the estimator $\tilde{P}_{K}^{(1)}(\hat{p}_{i,2},\hat{q}_{i,2})$ when $(p,q)\in\left[0,\frac{2c_{1}\ln n}{n}\right]^{2}$ .

Lemma 34.

Suppose $(p,q)\in\left[0,\frac{2c_{1}\ln n}{n}\right]^{2}$ , $(n\hat{p},n\hat{q})\sim\mathsf{Poi}(np)\times\mathsf{Poi}(nq)$ . Then,

[TABLE]

for some constant $B>0$ . The estimator $\tilde{P}_{K}^{(1)}$ is introduced in (45), and $K=c_{2}\ln n,c_{2}<c_{1}$ .

We then analyze the estimator $\tilde{P}_{K}^{(2)}(\hat{p}_{i,2},\hat{q}_{i,2};\hat{p}_{i,1},\hat{q}_{i,1})$ when $(p,q)\in U,p+q\geq\frac{c_{1}\ln n}{2n}$ .

Lemma 35.

Suppose $(p,q)\in U,p+q\geq\frac{c_{1}\ln n}{2n},x+y\geq\frac{p+q}{2},x\in[0,1],y\in[0,1]$ , where the set $U$ is defined in (30). Suppose $(n\hat{p},n\hat{q})\sim\mathsf{Poi}(np)\times\mathsf{Poi}(nq)$ . Then,

[TABLE]

for some constant $B>0$ . The estimator $\tilde{P}_{K}^{(2)}$ is introduced in (46), and $K=c_{2}\ln n,c_{2}<c_{1}$ .

Proof.

Recall the “good” events $E_{1},E_{2},E_{3},E_{4}$ defined in (33),(34),(35),(36) and introduce $E=E_{1}\cap E_{2}\cap E_{3}\cap E_{4}$ . We have

[TABLE]

where we have applied Lemma 4 and the constant $\beta$ is defined in (38).

Define the random variables

[TABLE]

where the random index sets $I_{1},I_{2},I_{3},I_{4}$ are defined as

[TABLE]

The index sets $I_{1},I_{2},I_{3},I_{4}$ are independent of the random variables $\{\hat{p}_{i,2}:1\leq i\leq S\}$ and $\{\hat{q}_{i,2}:1\leq i\leq S\}$ . It follows from the definition of the $E_{i}$ ’s that

[TABLE]

Hence, it follows from the Cauchy–Schwarz inequality that

[TABLE]

It follows from the law of total variance that

[TABLE]

where we have used the fact that $\mathbb{E}[\mathcal{E}_{1}|I_{1}]=0$ with probability one, the independence of $\hat{p}_{i,2}$ and $\hat{q}_{i,2}$ , and Lemma 26. Similarly we have $\mathbb{E}\mathcal{E}_{2}^{2}\leq\frac{2}{n}$ .

Regarding $\mathbb{E}\mathcal{E}_{3}^{2}$ , it follows from Lemma 34 and the mutual independence of $\{\hat{p}_{i,2}:1\leq i\leq S\}$ and $\{\hat{q}_{i,2}:1\leq i\leq S\}$ that

[TABLE]

where $\epsilon=c_{2}\ln B$ .

Regarding $\mathbb{E}\left(\mathcal{E}_{4}^{2}\right)$ , it follows from the bias-variance decomposition and Lemma 35 that

[TABLE]

where the constant $B$ is the one in Lemma 35. Taking expectations with respect to $\{(\hat{p}_{i,1},\hat{q}_{i,1}):1\leq i\leq S\}$ , we have

[TABLE]

where $\epsilon=c_{2}\ln B$ .

Combining everything together, we have

[TABLE]

where $\epsilon=c_{2}\ln B$ , and the constant $B$ is the larger constant between the one in Lemma 34 and Lemma 35. The constant $\beta$ is in (38).

If $\ln n\lesssim\ln S$ , we can take $c_{2}$ small enough and $c_{1},c_{3}$ large enough to guarantee that $\frac{S}{n^{\beta}}\lesssim\frac{S}{n\ln n},\frac{\ln^{5}n}{n^{1-\epsilon}}\lesssim\frac{S}{n\ln n}$ . Upon noting that $\hat{L}^{(2)}\in[0,2]$ , we have

[TABLE]

∎

Appendix C Proofs of main lemmas

C-A Proof of Lemma 1

We first consider the case of $q\leq\frac{c_{1}\ln n}{n}$ . In this case,

[TABLE]

where we used Lemma 22 in the last step. When $q>\frac{c_{1}\ln n}{n}$ , we have

[TABLE]

where we applied Lemma 22 again.

C-B Proof of Lemma 2

Since

[TABLE]

it suffices to control $\mathbb{P}(E_{i}^{c}),i=1,2,3$ separately. We have

[TABLE]

Note that if $q_{i}\leq\frac{c_{1}\ln n}{n}$ , then it follows from Lemma 22 that

[TABLE]

If $q_{i}>\frac{c_{1}\ln n}{n}$ , then it follows from Lemma 22 that

[TABLE]

Hence,

[TABLE]

Analogously, $\mathbb{P}(\hat{p}_{i,1}<U_{1}(q_{i}),p_{i}>q_{i})=0$ when $q_{i}\leq\frac{c_{1}\ln n}{n}$ , and when $q_{i}>\frac{c_{1}\ln n}{n}$ ,

[TABLE]

Hence,

[TABLE]

As for $\mathbb{P}(E_{3}^{c})$ , when $q_{i}\leq\frac{c_{1}\ln n}{n}$ ,

[TABLE]

When $q_{i}>\frac{c_{1}\ln n}{n}$ ,

[TABLE]

Consequently,

[TABLE]

C-C Proof of Lemma 3

It is clear that the square $\left[0,\frac{2c_{1}\ln n}{n}\right]^{2}\subset U$ . To see how we obtained the whole expression of $U$ , for any $x>\frac{c_{1}\ln n}{n}$ , we study the envelope of the parametrized extremal points $\left(x-\sqrt{\frac{c_{1}x\ln n}{n}},x+\sqrt{\frac{c_{1}x\ln n}{n}}\right)$ , where the other curve $\left(x+\sqrt{\frac{c_{1}x\ln n}{n}},x-\sqrt{\frac{c_{1}x\ln n}{n}}\right)$ can be dealt with analogously.

For $p=x-\sqrt{\frac{c_{1}x\ln n}{n}},q=x+\sqrt{\frac{c_{1}x\ln n}{n}}$ , we have

[TABLE]

Hence,

[TABLE]

We have that for all points $(p,q)\in\cup_{x\in[0,1]}U(x;c_{1})\times U(x;c_{1})$ ,

[TABLE]

where we used the inequality $\sqrt{p+q}\leq\sqrt{p}+\sqrt{q}$ in the last step.

C-D Proof of Lemma 4

It follows from the union bound that

[TABLE]

Hence, it suffices to analyze each $\mathbb{P}(E_{i}^{c}),i=1,2,3,4$ .

Analysis of $\mathbb{P}(E_{1}^{c})$ :

[TABLE]

It follows from Lemma 3 that the set $U(p_{i};(c_{1}+c_{3})/2)\times U(p_{i};(c_{1}+c_{3})/2)\subset U_{1}$ . Hence,

[TABLE]

It follows from Lemma 1 that

[TABLE] 2. 2.

Analysis of $\mathbb{P}(E_{2}^{c})$ : following similar steps as in the analysis of $\mathbb{P}(E_{1}^{c})$ , we have $\mathbb{P}(E_{2}^{c})\leq\frac{4S}{n^{\frac{c_{1}+c_{3}}{6}}}$ . 3. 3.

Analysis of $\mathbb{P}(E_{3}^{c})$ :

[TABLE]

where we have used the fact that $n\hat{p}_{i,1}+n\hat{q}_{i,1}\sim\mathsf{Poi}(np+nq)$ and Lemma 22. 4. 4.

Analysis of $\mathbb{P}(E_{4}^{c})$ :

[TABLE]

We have

[TABLE]

and

[TABLE]

It suffices to show that there exists some constant $c>0$ such that

[TABLE]

where $U(\cdot;c)$ is defined in (10). Indeed, in this case it follows from Lemma 1 that

[TABLE]

Now we work to prove (279). Without loss of generality we assume $(p,q)$ satisfies $\sqrt{q}-\sqrt{p}\geq\sqrt{\frac{2c_{1}\ln n}{n}}$ and the constant $c<c_{1}$ . Under this assumption we have $q\geq\frac{2c_{1}\ln n}{n}$ . We will show that for any point $(x,y)\in U(p;c)\times U(q;c)$ , we have $\sqrt{y}-\sqrt{x}\geq\sqrt{\frac{(c_{1}+c_{3})\ln n}{n}}$ , thereby proving (279).

If $p\leq\frac{c\ln n}{n}$ , we have for any $(x,y)\in U(p;c)\times U(q;c)$ ,

[TABLE]

where in the second step we used the fact that the function $x-\sqrt{ax},a>0$ is monotonically increasing when $x\geq a/4$ . Hence, we need to guarantee that

[TABLE]

which can be reduced to the quadratic inequality:

[TABLE]

One can easily verify that $c=\frac{(c_{1}-c_{3})^{2}}{32c_{1}}$ satisfies this inequality since $0<c_{3}<c_{1}$ .

Now we consider the case of $p>\frac{c\ln n}{n}$ . Then, for any $(x,y)\in U(p;c)\times U(q;c)$ ,

[TABLE]

Further, since $p>\frac{c\ln n}{n}$ ,

[TABLE]

where we used the fact that $\frac{x+\sqrt{p}}{x+\sqrt{2p}}$ is a monotonically increasing function of $x$ when $x\geq 0$ , and the function $\frac{2x+a}{(\sqrt{2}+1)x+a}$ is a monotonically decreasing function of $x$ when $a>0,x>0$ . To guarantee that $\sqrt{y}-\sqrt{x}\geq\sqrt{\frac{(c_{1}+c_{3})\ln n}{n}}$ , we need

[TABLE]

which is equivalent to

[TABLE]

with the constraint that $\frac{c_{3}}{c_{1}}<\frac{8}{(\sqrt{2}+1)^{2}}-1\approx 0.373$ .

C-E Proof of Lemma 5

We consider two different parameter settings.

$S\ll n\lesssim S\ln S$ : In this case, we construct the distribution $P$ as 222Technically, the distribution $P$ has support no more than $S$ . However, a standard continuity argument implies that the same conclusion holds.

[TABLE]

where $c>2c_{1}$ is a constant that will be chosen later, and $Q=P$ . Without loss of generality we assume $\frac{n}{c\ln n}$ is an integer. We now argue that for each index $1\leq i\leq\frac{n}{c\ln n}$ ,

[TABLE]

It follows from Lemma 22 that $\mathbb{P}\left(\hat{p}_{i}\leq\frac{2c_{1}\ln n}{n}\right)\leq e^{-\frac{1}{2}(1-2c_{1}/c)^{2}c\ln n}=n^{-\beta}$ , where $\beta=\frac{c}{2}\left(1-\frac{2c_{1}}{c}\right)^{2}$ . Note that $\beta$ can be made arbitrarily large by taking the constant $c$ large. Define $E=\left\{\hat{p}_{i}\geq\frac{2c_{1}\ln n}{n},\hat{q}_{i}\geq\frac{2c_{1}\ln n}{n}\right\}$ . We have

[TABLE]

Since $|g|\leq B$ , we have

[TABLE]

It follows from the triangle inequality that

[TABLE]

It follows from the conditional version of Jensen’s inequality that $\mathbb{E}|\hat{p}_{i}-\hat{q}_{i}|\geq\mathbb{E}|\hat{p}_{i}-p_{i}|$ , and by Lemma 24 we have

[TABLE]

Since $\frac{\sqrt{\ln n}}{n}\gg\frac{2(B+1)}{n^{\beta}}$ for $\beta>1$ , we conclude that (300) is true. Hence, the total bias of $\hat{L}$ is at least $\left(\frac{n}{c\ln n}\frac{\sqrt{\ln n}}{n}\right)^{2}=\frac{1}{\ln n}\gg\frac{S}{n\ln n}$ since $S\ll n$ . 2. 2.

$n\gg S\ln S$ : In this case, we construct $P,Q$ to be uniform distributions with support size $S$ . Since $\frac{1}{S}\gg\frac{\ln n}{n}$ , it follows from arguments analogous to those above that the squared bias of $\hat{L}$ is at least the order $\left(S\sqrt{\frac{1}{2Sn}}\right)^{2}=\frac{S}{2n}\gg\frac{S}{n\ln n}$ .

C-F Proof of Lemma 6

Since $\left||a|-|b|\right|\leq|a-b|$ , it suffices to show that there exists a universal constant $C>0$ such that

[TABLE]

for $|t|\leq 1$ . Define $\sqrt{x}=|t|$ . Since $Q_{K}(t)$ is even, it follows that $Q_{K}(t)=R(t^{2})$ , where $R\in\mathsf{poly}_{K}$ is a polynomial. The polynomial $R$ satisfies the following:

[TABLE]

It suffices to show that $|R(x)|\leq CKx$ . Let $T(x)\in\mathsf{poly}_{K}$ denote the best approximation polynomial of the function $\sqrt{x}$ on $[0,1]$ with order no more than $K$ . It follows from Lemma 9 and Lemma 10 that $\sup_{x\in[0,1]}|T(x)-\sqrt{x}|\lesssim\frac{1}{K}$ . It follows from the triangle inequality that

[TABLE]

It follows from the Markov inequality (Lemma 13) that $\sup_{x\in[0,1]}|R^{\prime}(x)-T^{\prime}(x)|\lesssim K$ . Since for any $0\leq x\leq 1$ ,

[TABLE]

it suffices to show $\left|\int_{0}^{x}T^{\prime}(u)du\right|\lesssim Kx$ .

It follows from Lemma 14 and Lemma 10 that

[TABLE]

Hence, it follows from Lemma 15 that

[TABLE]

Hence,

[TABLE]

The proof is complete.

C-G Proof of Lemma 7

We prove the lemma by contradiction. Assuming the contrary, then there exist universal constants $c,C>0$ and polynomial $P(x,y)\in\mathsf{poly}_{K}^{2}$ of degree $K=c\ln n$ such that

[TABLE]

where $U^{\prime}=\cup_{x\in[\frac{c_{1}\ln n}{n},t_{n}]}U(x;c_{1})\cup U(x;c_{1})$ . Now for any $t\in[\frac{c_{1}\ln n}{n},t_{n}]$ , we have $(t-\frac{1}{2}\sqrt{\frac{c_{1}t\ln n}{n}},t+\frac{1}{2}\sqrt{\frac{c_{1}t\ln n}{n}})\in U^{\prime}$ , and plugging in this pair yields

[TABLE]

Similarly, for $(t+\frac{1}{2}\sqrt{\frac{c_{1}t\ln n}{n}},t-\frac{1}{2}\sqrt{\frac{c_{1}t\ln n}{n}})\in U^{\prime}$ we also have

[TABLE]

Now consider

[TABLE]

it is easy to see that $Q(t)$ is a polynomial of $t$ , and $\deg Q\leq 2K$ . Moreover, adding the previous two inequalities together, by triangle inequality we obtain

[TABLE]

Since $t_{n}\gg\frac{(\ln n)^{3}}{n}$ , we have $\eta_{n}\triangleq\frac{c_{1}\ln n}{nt_{n}}\ll\frac{1}{K^{2}}$ . Define $R(t)=t_{n}^{-\frac{1}{2}}Q(t_{n}\cdot t)$ for $t\in[\eta_{n},1]$ , (328) becomes

[TABLE]

Moreover, $\deg R\leq 2K$ . Now let $S$ be the best degree- $2K$ approximating polynomial of $\sqrt{t}$ in the uniform norm on $[\eta_{n},1]$ , using second-order Ditzian–Totik modulus of smoothness (Lemma 9) and $\eta_{n}\ll\frac{1}{K^{2}}$ we arrive at

[TABLE]

Furthermore, following the proof of Lemma 6 we can prove that

[TABLE]

As a result, by triangle inequality we have

[TABLE]

Since $R(t)-S(t)$ is also a polynomial of degree $\leq 2K$ , by Markov’s inequality (Lemma 13)

[TABLE]

and finally by triangle inequality again

[TABLE]

Now we are about to arrive at the desired contradiction. Choosing $t=\eta_{n}$ and $t=2\eta_{n}$ in (329), we have

[TABLE]

with $D>0$ a suitable universal constant appearing in the RHS of (329). As a result,

[TABLE]

and by the mean value theorem we conclude that there exists some $\xi\in[t_{n},2t_{n}]$ such that

[TABLE]

where the last inequality follows from the fact that $\eta_{n}\ll\frac{1}{K^{2}}$ . However, this inequality is contradicting to our previous result (336), and thus we are done. ∎

C-H Proof of Lemma 8

We have

[TABLE]

where the last step follows from the result of minimax risk for estimating the discrete distribution $P$ under $\ell_{1}$ loss in [31, Cor. 4].

C-I Proof of Lemma 27

To simplify the notation we denote $\Delta=\frac{c_{1}\ln n}{n}$ . We split the proof into two cases: $q\leq\Delta$ and $q>\Delta$ .

The case $q\leq\Delta,p\in U(q;c_{1})=[0,2\Delta]$ . In this case, it follows from (18) that $P_{K}(x;q)$ is the best approximation polynomial of function $|x-q|$ over $x\in[0,2\Delta]$ . Define $y=\frac{x}{2\Delta}$ and introduce function

[TABLE]

Define the best approximation polynomial of $g(y)\in C[0,1]$ with order $K$ as

[TABLE]

It follows from Lemma 9 and 12 that there exists a universal constant $M>0$ such that

[TABLE]

Since the approximation performance of $H_{K}(y)$ is at least as good as that of a constant, and $\max_{y\in[0,1]}|g(y)|\lesssim\Delta$ , we know that there exists another universal constant $M_{1}>0$ such that

[TABLE]

Denoting $H_{K}(y)=\sum_{j=0}^{K}a_{j}y^{j}$ and using $x=2\Delta y$ , we know

[TABLE]

It follows from Lemma 18 that $\prod_{k=0}^{j-1}\left(\hat{p}-\frac{k}{n}\right)$ is the unique uniformly minimum variance unbiased estimator of $p^{j}$ when $n\hat{p}\sim\mathsf{Poi}(np)$ . Hence,

[TABLE]

and $\mathbb{E}\tilde{P}_{K}(\hat{p};q)=P_{K}(p;q)$ . Since $H_{K}(z^{2})=\sum_{j=0}^{K}a_{j}z^{2j}$ is a polynomial with degree no more than $2K$ and satisfies

[TABLE]

It follows from Lemma 17 that for all $0\leq j\leq K$ ,

[TABLE]

Now we prove the variance properties of $\tilde{P}_{K}(\hat{p};q)$ . We have

[TABLE]

where we have applied Lemma 18 with $M=\max\{2n\Delta,K\}=2n\Delta$ since we have assumed $c_{2}<c_{1},K=c_{2}\ln n$ .

Since $p\leq 2\Delta$ , we have

[TABLE]

where $B>0$ is some universal constant. 2. 2.

The case $q>\Delta$ . In this case, it follows from (18) that $P_{K}(x;q)$ is the best approximation polynomial of function $|x-q|$ over $x\in[q-\sqrt{q\Delta},q+\sqrt{q\Delta}]$ . Denote the best approximation polynomial of $|y|$ on $[-1,1]$ with order $K$ as

[TABLE]

Using $x=q+y\sqrt{q\Delta}$ , we have

[TABLE]

It is well known that [20, Chap. 9, Thm. 3.3] there exists a universal constant $M_{3}$ such that

[TABLE]

Consequently, for $p\in[q-\sqrt{q\Delta},q+\sqrt{q\Delta}]$ ,

[TABLE]

It follows from Lemma 18 that $g_{j,q}(\hat{p}),n\hat{p}\sim\mathsf{Poi}(np)$ defined as

[TABLE]

is the unique uniformly minimum variance unbiased estimator for $(p-q)^{j},j\geq 0,j\in\mathbb{N}$ . Hence,

[TABLE]

It was shown in Cai and Low [11, Lemma 2] that $|r_{j}|\leq 2^{3K},0\leq j\leq K$ . We study the variance properties of $\tilde{P}_{K}(\hat{p};q)$ as follows.

Define $M_{4}\triangleq\max\{K,\frac{n(p-q)^{2}}{p}\}$ . Note that if $p=0$ the variance of this $\tilde{P}_{K}(\hat{p};q)$ is zero. We now consider $p\neq 0$ . Applying Lemma 18 and the fact that the standard deviation of a sum of random variables is upper bounded by the sum of standard deviations of corresponding random variables, we have

[TABLE]

where $c=\max\{\sqrt{2},2\sqrt{c_{2}/c_{1}}\}$ , and $B>0$ is some universal constant. Recall that $K=c_{2}\ln n,\Delta=\frac{c_{1}\ln n}{n}$ . It suffices to show $\sqrt{\frac{2M_{4}p}{nq\Delta}}\leq c$ to complete the proof. Indeed, we have

[TABLE]

C-J Proof of Lemma 30

It is clear that $\sup_{x\in[0,1]}|f(x;a)|\leq 1$ . Introduce

[TABLE]

where $\eta=\frac{a}{D},D>1$ . We have $E_{L}[f(x;a);[\eta,1]]=E_{L}[f_{\eta}(x;a);[0,1]]$ . Recall the second-order Ditzian–Totik modulus of smoothness given in (62)

[TABLE]

where $\varphi=\sqrt{x(1-x)},\Delta_{h\varphi}^{2}f(x)=f(x+h\varphi)+f(x-h\varphi)-2f(x)$ .

We deal with the two cases separately.

$\frac{1}{L^{2}}\leq a\leq\frac{1}{2}$ : Denote $\delta=\frac{1}{DL}\varphi\left(\frac{a-\eta}{1-\eta}\right)$ . It is easy to verify that if $\frac{1}{L^{2}}\leq a\leq\frac{1}{2},D\geq 3$ , then $\frac{1}{1+(DL)^{2}}\leq\frac{a-\eta}{1-\eta}\leq\frac{(DL)^{2}}{1+(DL^{2})}$ , which ensures that $\frac{a-\eta}{1-\eta}\pm\delta\in[0,1]$ . We lower bound $\omega_{\varphi}^{2}(f,t)$ for $f_{\eta}$ as follows:

[TABLE]

Since

[TABLE]

we have

[TABLE]

The relationship between $\omega_{\varphi}^{2}(f,\frac{1}{n})$ and $E_{n}[f;[0,1]]$ was shown in [24, Thm. 7.2.4.] that there exists a universal positive constant $M_{2}$ such that

[TABLE]

Utilizing the non-increasing property of $E_{n}[f_{\eta};[0,1]]$ with respect to $n$ yields

[TABLE]

Now we work out an upper bound on $E_{k}[f_{\eta};[0,1]]$ . It follows from Lemma 9 that there exists a universal constant $M_{1}$ such that

[TABLE]

where $\omega_{\varphi}^{1}(f,t)=\sup_{0<h\leq t}\sup_{x}|\Delta_{h\varphi}^{1}f(x)|$ , where $\Delta_{h\varphi}^{1}f(x)=f(x+h\varphi/2)-f(x-h\varphi/2)$ . It follows from straightforward algebra that $\omega_{\varphi}^{1}(f_{\eta},\frac{1}{k})\lesssim\frac{1}{k\sqrt{a}}$ . Hence,

[TABLE]

Since $\delta=\frac{1}{DL}\varphi\left(\frac{a-\eta}{1-\eta}\right),\eta=\frac{a}{D},\frac{1}{L^{2}}\leq a\leq\frac{1}{2}$ , for $D$ large enough, we know $\delta\gtrsim\frac{\sqrt{a}}{DL}$ and there exist two universal constants $M_{3}>0,M_{4}>0$ such that

[TABLE]

where we used the fact that $\frac{\sqrt{a}}{DL}\leq a$ for $a\geq\frac{1}{L^{2}},D\geq 1$ . Hence,

[TABLE]

when $D$ is large enough. 2. 2.

$0<a<\frac{1}{L^{2}}$ : for $D>1$ we have

[TABLE]

where $\epsilon=\min\left\{\frac{1}{DL}\varphi\left(\frac{a-\eta}{1-\eta}\right),\frac{a-\eta}{1-\eta}\right\}$ .

Since

[TABLE]

it suffices to lower bound $f_{\eta}\left(\frac{a-\eta}{1-\eta}+\epsilon\right)$ to lower bound $\omega_{\varphi}^{2}(f_{\eta},(DL)^{-1})$ . Note that the function $f_{\eta}(\cdot)$ is a non-decreasing function.

We have $\frac{1}{DL}\varphi\left(\frac{a-\eta}{1-\eta}\right)\gtrsim\frac{\sqrt{a}}{DL}$ for $D$ large enough, and

[TABLE]

where we used the fact that the function $\frac{\sqrt{x}}{\sqrt{x}+DLx}$ is a non-increasing function for $x\geq 0$ .

Hence, we have shown that

[TABLE]

Following (393), we have that

[TABLE]

when $D$ is large enough. Here we used the fact that $E_{k}[f_{\eta};[0,1]]\leq 1$ .

C-K Proof of Lemma 32

We have

[TABLE]

For each $j\geq 0,j\in\mathbb{Z}$ , we introduce function

[TABLE]

where $x\in[-1,1]$ . We introduce $a+MX_{i}=U_{i},i=0,1$ . It follows from the assumptions that $\mathbb{E}[X_{0}^{j}]=\mathbb{E}[X_{1}^{j}],0\leq j\leq L$ .

We write the series expansion of $f_{j}(x)$ as follows:

[TABLE]

Hence,

[TABLE]

where we used the fact that $X_{i}\in[-1,1],i=0,1$ .

It follows from the Leibniz formula for derivatives of products of functions that

[TABLE]

Hence,

[TABLE]

Construct random variable $Z\sim\mathsf{Poi}(a)$ . Then,

[TABLE]

where $(j)_{m}=j(j-1)\cdots(j-m+1)$ .

Consequently,

[TABLE]

where $g_{k,q}(Z)$ is the estimator introduced in Lemma 18 for the case of $n=1$ .

It follows from Lemma 18 that

[TABLE]

Hence, it follows from $k!\geq\frac{k^{k}}{e^{k}}$ that

[TABLE]

It follows from the assumptions that

[TABLE]

Consequently,

[TABLE]

C-L Proof of Lemma 33

We define the minimax risk under the multinomial sampling model for a fixed $Q$ as

[TABLE]

Fix $\delta>0$ . Let $\hat{L}=\hat{L}(X_{1},X_{2},\ldots,X_{S})$ be a near-minimax estimator of $\|P-Q\|_{1}$ under the multinomial model for every sample size $n$ , which means that for every sample size $n$ ,

[TABLE]

Here the random vector $(X_{1},X_{2},\ldots,X_{S})$ follows multinomial distribution parametrized by $n,P$ , and the estimator $\hat{L}$ obtains the number of samples $n$ from this random vector.

Now we consider the Poisson sampling model, where $X_{i}$ ’s are mutually independent with marginal distributions $X_{i}\sim\mathsf{Poi}(np_{i})$ . Let $n^{\prime}=\sum_{i=1}^{S}X_{i}\sim\mathsf{Poi}(n\sum_{i=1}^{S}p_{i})$ . We use the estimator $\hat{L}(X_{1},X_{2},\ldots,X_{S})$ to estimate $\|P-Q\|_{1}$ under the Poisson sampling model. For any $P\in\mathcal{M}_{S}(\epsilon)$ under the Poisson sampling model, we have

[TABLE]

where we used the fact that $(a+b)^{2}\leq 2a^{2}+2b^{2}$ for any $a,b\in\mathbb{R}$ , and the fact that if $\sum_{i=1}^{S}p_{i}=A$ , then

[TABLE]

Then,

[TABLE]

where we used the fact that conditioned on $n^{\prime}=m$ , $(X_{1},X_{2},\ldots,X_{S})$ follows multinomial distribution parametrized by $\left(m,\frac{P}{\sum_{i=1}^{S}p_{i}}\right)$ , the monotonicity of $R(S,m,Q)$ as a function of $m$ , $R(S,m,Q)\leq 1$ , and Lemma 22.

Taking supremum of $\mathbb{E}_{P}\left(\hat{L}-\|P-Q\|_{1}\right)^{2}$ over $\mathcal{M}_{S}(\epsilon)$ and using the arbitrariness of $\delta$ , we have

[TABLE]

which is equivalent to

[TABLE]

It follows from [7, Lemma 16] that $R(S,n,Q)\leq 2R_{P}(S,n/2,Q,0)$ . Hence,

[TABLE]

C-M Proof of Lemma 34

We first analyze the bias. To simplify the notation we denote $\Delta=\frac{c_{1}\ln n}{n}$ . It follows from the definition of $\tilde{P}_{K}^{(1)}$ that for $(p,q)\in\left[0,2\Delta\right]^{2}$ ,

[TABLE]

where $h_{2K}(x,y)=u_{K}(x,y)v_{K}(x,y)-u_{K}(0,0)v_{K}(0,0)$ , and $u_{K}(x,y)$ and $v_{K}(x,y)$ satisfy (1).

We first argue that there exists a universal constant $M>0$ such that $\sup_{(x,y)\in[0,1]^{2}}|u_{K}(x,y)v_{K}(x,y)-|x-y||\leq M\left(\frac{\sqrt{x}+\sqrt{y}}{K}+\frac{1}{K^{2}}\right)$ . Indeed,

[TABLE]

It follows from Lemma 9 and Lemma 11 that the best polynomial approximation error of $(\sqrt{x}+\sqrt{y})$ and $|\sqrt{x}-\sqrt{y}|$ over the unit square are both of order $\frac{1}{K}$ . Hence,

[TABLE]

which implies that there exists another constant $M>0$ such that

[TABLE]

Denote $x=\frac{p}{2\Delta},y=\frac{q}{2\Delta}$ , we have

[TABLE]

We now analyze the variance. Express the polynomial $h_{2K}(x,y)\in\mathsf{poly}_{2K}^{2}$ explicitly as

[TABLE]

For any fixed value of $y$ , $h_{2K}(x^{2},y^{2})$ is a polynomial of $x$ with degree no more than $4K$ that is uniformly bounded by a universal constant on $[-1,1]$ . It follows from Lemma 17 that for any fixed $y\in[-1,1]$ ,

[TABLE]

which, together with Lemma 17, implies that

[TABLE]

Since $\tilde{P}_{K}^{(1)}$ is the unbiased estimator of $2\Delta h_{2K}\left(\frac{p}{2\Delta},\frac{q}{2\Delta}\right)$ , we know

[TABLE]

where $g_{j,q}(\hat{p})$ is the unbiased estimator for $(p-q)^{j}$ introduced in Lemma 18.

Denote $\|X\|_{2}=\sqrt{\mathbb{E}(X-\mathbb{E}X)^{2}}$ and $M_{1}=2K\vee 2n\Delta$ . Using the triangle inequality of the norm $\|\cdot\|_{2}$ and Lemma 18, we know

[TABLE]

Since for any $x\in[0,1],y\in[0,1]$ ,

[TABLE]

we know

[TABLE]

for some constant $B>0$ . Hence,

[TABLE]

C-N Proof of Lemma 35

We first analyze the bias. It follows from the definition of $\tilde{P}_{K}^{(2)}$ that

[TABLE]

where $W=\sqrt{\frac{8c_{1}\ln n}{n}}\sqrt{(x+y)\vee\frac{1}{n}}$ .

Since $(p+q)\in U$ , we know

[TABLE]

where we have used the fact that $\sqrt{p}+\sqrt{q}\leq\sqrt{2(p+q)}$ and the assumption that $p+q\leq 2(x+y)$ .

Hence, it follows from the property that the best degree- $K$ polynomial approximation error of $|t|$ over $[-1,1]$ is $\Theta(\frac{1}{K})$ [20, Chap. 9, Thm. 3.3] that

[TABLE]

Then we analyze the variance. It was shown in Cai and Low [11, Lemma 2] that $|r_{j}|\leq 2^{3K},0\leq j\leq K$ . Denote the unbiased estimator of $(p-q)^{j}$ by $\hat{A}_{j}(\hat{p},\hat{q})$ and introduce the norm $\|X\|_{2}=\sqrt{\mathbb{E}(X-\mathbb{E}[X])^{2}}$ . It follows from the triangle inequality of the norm $\|X\|_{2}$ and the fact that constants have zero variance that

[TABLE]

It follows from Lemma 19 that

[TABLE]

Hence,

[TABLE]

where

[TABLE]

Consequently,

[TABLE]

where $B$ is a constant.

Appendix D Proofs of auxiliary lemmas

D-A Proof of Lemma 11

We split the analysis of $|f(x+h\varphi/2)-f(x-h\varphi/2)|$ into two cases:

$x-\frac{h\varphi}{2}\geq a$ or $x+\frac{h\varphi}{2}\leq a$ : in this case,

[TABLE]

where we have used the fact that $\sqrt{x}+\sqrt{y}\geq\sqrt{x+y}$ and $0<h\leq t$ . 2. 2.

$x-\frac{h\varphi}{2}<a<x+\frac{h\varphi}{2}$ : in this case

[TABLE]

D-B Proof of Lemma 12

It follows from taking derivatives that for convex function $f(x)$ , the function $f(x-t)-2f(x)+f(x+t)$ is a nondecreasing function of $t$ . Since $f(x)=|2x\Delta-q|$ is a convex function, it follows from straightforward algebra that

[TABLE]

where

[TABLE]

We break the proof into three parts.

We first prove that when $\frac{1}{1+K^{2}}\leq\frac{q}{2\Delta}\leq\frac{K^{2}}{1+K^{2}}$ , the maximum of achieved by $A_{1}(z)$ at $z=\frac{q}{2\Delta}$ .

Consider first the case $\frac{1}{1+K^{2}}\leq z\leq\frac{K^{2}}{1+K^{2}}$ and function $A_{1}(z)$ . If $z>\frac{q}{2\Delta}$ , without loss of generality we can assume $z-\frac{\sqrt{z(1-z)}}{K}<\frac{q}{2\Delta}$ , since otherwise $A_{1}(z)=0$ . Then,

[TABLE]

Taking derivative with respect to $z$ , it suffices to show this derivative is non-positive when $\frac{1}{1+K^{2}}\leq\frac{q}{2\Delta}\leq\frac{K^{2}}{1+K^{2}},z\geq\frac{q}{2\Delta},z-\frac{\sqrt{z(1-z)}}{K}<\frac{q}{2\Delta}$ . We have the derivative expressed as

[TABLE]

Since $1-2z-2K\sqrt{z(1-z)}$ is a convex function, it achieves its maximum at the endpoints. When we set $z=\frac{1}{1+K^{2}}$ and $z=1$ it is both negative. Similar arguments work for the case of $z<\frac{q}{2\Delta}$ . Hence, we conclude that when $\frac{1}{1+K^{2}}\leq\frac{q}{2\Delta}\leq\frac{K^{2}}{1+K^{2}}$ ,

[TABLE]

Consider the case $z>\frac{K^{2}}{1+K^{2}}$ and the function $A_{3}(z)$ . It suffices to assume $2z-1\leq\frac{q}{2\Delta}$ since otherwise $A_{3}(z)=0$ . In this case

[TABLE]

which is a decreasing function in $z$ , implying $\max_{z>\frac{K^{2}}{1+K^{2}}}A_{3}(z)\leq\max_{\frac{1}{1+K^{2}}\leq z\leq\frac{K^{2}}{1+K^{2}}}A_{1}(z)$ . Similar arguments work for the $z<\frac{1}{1+K^{2}}$ and $A_{2}(z)$ case. 2. 2.

We now prove that when $\frac{q}{2\Delta}\leq\frac{1}{1+K^{2}}$ , the maximum is achieved by $A_{2}(z)$ at $\frac{q}{2\Delta}$ .

In this case, it suffices to consider $z\leq\frac{1}{1+K^{2}}$ . Indeed, if $z>\frac{1}{1+K^{2}}$ , then the second order difference in the non-zero case is given by (1), which is shown to be a decreasing function when $z>\frac{1}{1+K^{2}}$ . Now consider $z\leq\frac{1}{1+K^{2}}$ . We discuss three cases separately:

(a)

$2z\leq\frac{q}{2\Delta}$ : in this case, $A_{2}(z)=0$ . 2. (b)

$z\leq\frac{q}{2\Delta}\leq 2z$ : in this case,

[TABLE]

which is an increasing function of $z$ . It implies that in this regime one should take $z=\frac{q}{2\Delta}$ . The resulting $A_{2}(z)$ is $2q$ . 3. (c)

$\frac{q}{2\Delta}\leq z$ : in this case, the second order difference is

[TABLE]

which is independent of $z$ .

Hence, we have shown that for $\frac{q}{2\Delta}\leq\frac{1}{1+K^{2}}$ , the maximum is achieved by $A_{2}(z)$ and

[TABLE] 3. 3.

The case of $2\Delta\geq\frac{q}{2\Delta}\geq\frac{K^{2}}{1+K^{2}}$ can be dealt with in a fashion similar to the case of $\frac{q}{2\Delta}\leq\frac{1}{1+K^{2}}$ , resulting in

[TABLE]

D-C Proof of Lemma 16

It suffices to show that for any polynomial $Q\in\mathsf{poly}_{2K}$ ,

[TABLE]

Define

[TABLE]

It is clear that $e(z)$ is an even function, $o(z)$ is an odd function, and $Q(z)=e(z)+o(z)$ . We have

[TABLE]

where we have used the fact that $\max\{a,b\}\geq\frac{a+b}{2}$ and the convexity of the function $|z|$ .

There exists another polynomial $U_{K}(z)\in\mathsf{poly}_{K}$ such that $U_{K}(z^{2})=e(z)$ . Hence, for any $Q\in\mathsf{poly}_{2K}$ ,

[TABLE]

where we used the definition of $P_{K}(z)$ . The proof is complete.

D-D Proof of Lemma 18

The Charlier polynomial $c_{n}(x,a),a>0$ is defined as follows:

[TABLE]

where $(x)_{r}=x\cdot(x-1)\cdot\cdots\cdot(x-r+1)$ is the falling factorial. It satisfies the following generating function relation [32]:

[TABLE]

Substituting $t$ by $at$ , we have

[TABLE]

Note that we have

[TABLE]

which is well defined even for $a=0$ . If $a=0$ , then $a^{n}c_{n}(x,a)$ may be defined as

[TABLE]

We note that relation (544) is true also when $a=0$ . Indeed, the case $a=0$ reduces to the relation:

[TABLE]

Assuming $Y\sim\mathsf{Poi}(\lambda)$ , replacing $x$ with random variable $Y$ in (544) and taking expectation on both sides, we have

[TABLE]

Note that $\mathbb{E}a^{n}c_{n}(Y,a)$ does not depend on $t$ . Hence we know

[TABLE]

Thus, if $nX\sim\mathsf{Poi}(np),a=nq$ , we have

[TABLE]

Expanding $q^{j}c_{j}(nX,nq)$ implies that it is equal to $g_{j,q}(X)$ defined in Lemma 18. The estimator $g_{j,q}(X)$ being the unique uniformly minimum variance unbiased estimator of $(p-q)^{j}$ follows from the Lehmann–Scheffe Theorem [33, Chap. 2, Thm. 1.11] and the complete sufficiency of $X$ in model $nX\sim\mathsf{Poi}(np)$ ([33, Chap. 1, Thm. 6.22]).

Now we proceed to bound the second moment of $g_{j,q}(X)$ . It follows from (544) that for any $a+b\geq 0$ ,

[TABLE]

which implies that

[TABLE]

It follows from coefficient matching that

[TABLE]

which simplifies to

[TABLE]

Now assume $nX\sim\mathsf{Poi}(np)$ . Taking $a+b=nq,a=np$ and dividing both sides by $n^{j}$ , we have

[TABLE]

The Charlier polynomials are orthogonal with respect to the Poisson measure. Concretely, for $Y\sim\mathsf{Poi}(\lambda)$ [32],

[TABLE]

For $nX\sim\mathsf{Poi}(np)$ , we have

[TABLE]

which is also true for $p=0$ .

Applying the orthogonal property of Charlier polynomials and assuming $p>0$ , we have

[TABLE]

where $L_{m}(x)$ stands for the Laguerre polynomial with order $m$ , which is defined as:

[TABLE]

If we further assume $M\geq\max\left\{\frac{n(p-q)^{2}}{p},j\right\}$ , we have

[TABLE]

D-E Proof of Lemma 19

It follows from the fact that $\mathbb{E}\prod_{i=0}^{k-1}(\hat{p}-\frac{i}{n})=p^{k}$ for $n\hat{p}\sim\mathsf{Poi}(nq)$ [21, Ex. 2.8] that $\hat{A}_{j}$ is unbiased for estimating $(p-q)^{j}$ . It follows from the Lehmann–Scheffe Theorem [33, Chap. 2, Thm. 1.11] and the complete sufficiency of $(\hat{p},\hat{q})$ ([33, Chap. 1, Thm. 6.22]) that $\hat{A}_{j}$ is the unique uniformly minimum variance unbiased estimator for $(p-q)^{j}$ .

We now work out a different form of $\hat{A}_{j}$ . It follows from the binomial theorem that for any fixed $r>0$ ,

[TABLE]

Clearly, the following estimator is also unbiased for estimating $(p-q)^{j}$ :

[TABLE]

where $g_{k,r}(\hat{p})$ and $g_{j-k,r}(\hat{q})$ are the unique uniformly minimum variance unbiased estimators for $(p-r)^{k}$ and $(q-r)^{j-k}$ introduced in Lemma 18, respectively. It follows from the uniqueness of $\hat{A}_{j}$ that

[TABLE]

Using $\|X\|_{2}=(\mathbb{E}[X^{2}])^{1/2}$ and the triangle inequality for the norm $\|X\|_{2}$ , we have

[TABLE]

where we have used the independence of $\hat{p}$ and $\hat{q}$ in the last step.

Define $M_{1}=\frac{n(p-r)^{2}}{p}\vee j$ , $M_{2}=\frac{n(q-r)^{2}}{q}\vee j$ , and set $r=\frac{p+q}{2}$ . Define $M=2(p-q)^{2}\vee\frac{8j(p\vee q)}{n}$ . It follows from Lemma 18 that

[TABLE]

D-F Proof of Lemma 20

Equation (84) follows from [24, Lemma 9.5.5.]. Now we prove the bound on the magnitude of $|h_{j,s}|$ . Note that the moment generating function of $\hat{p}-p$ is given by

[TABLE]

Written as formal power series of $z$ , the previous identity becomes

[TABLE]

Hence, by comparing the coefficient of $n^{j-s}z^{s}$ at both sides, we obtain

[TABLE]

Moreover,

[TABLE]

Then,

[TABLE]

Since $1\leq j\leq s$ , we have

[TABLE]

Now we consider the maximization problem $\max_{x\geq 0}\frac{x^{s}e^{x}}{x^{x}}$ . It follows from taking derivatives that this function attains it unique maximum at point $x^{*}$ which satisfies the following:

[TABLE]

Recall the Lambert $W$ function is defined over $[-1/e,\infty)$ by the equation $W(z)e^{W(z)}=z$ , we know that

[TABLE]

The following upper bound on $W(s)$ was proved in [34]: for any $s>e$ ,

[TABLE]

where we have used the fact that $\max_{x>0}\frac{\ln x}{x}=\frac{1}{e}$ .

Hence, for any $s\geq 3$ ,

[TABLE]

which turns out to be also correct for $s=2$ since $h_{1,2}=1$ .

D-G Proof of Lemma 21

It is clear that when $p\leq\frac{1}{n}$ , the statement is true. It suffices to consider the case of $p>\frac{1}{n}$ . Introduce function $g_{n}(p)$ as follows:

[TABLE]

It is evident that $g_{n}(p)\leq\frac{1}{p^{j}}$ and

[TABLE]

We have

[TABLE]

Since the function $g_{n}(p)$ is continuously differentiable on $(0,\infty)$ , we have

[TABLE]

where we applied Lemma 22 in the last step. Hence,

[TABLE]

where in the last step we used the the assumption that $p\geq\frac{1}{n}$ . Consequently,

[TABLE]

where we have used the fact that for any $p\geq 0$ ,

[TABLE]

D-H Proof of Lemma 24

The following upper bound is straightforward:

[TABLE]

Regarding the other upper bound and the lower bound, we utilize the exact analytic expression [35] for $\mathbb{E}|X-\lambda|$ for $X\sim\mathsf{Poi}(\lambda)$ . It follows from [35] that for random variable $X\sim\mathsf{Poi}(\lambda)$ ,

[TABLE]

where $[\lambda]$ denotes the greatest integer less than or equal to $\lambda$ .

When $0<\lambda\leq 1$ , we have

[TABLE]

which implies that if $0<q\leq\frac{1}{n}$ ,

[TABLE]

Regarding the final lower bound, it suffices to show that for $X\sim\mathsf{Poi}(\lambda),\lambda\geq 1$ , we have

[TABLE]

Hence, it suffices to show

[TABLE]

for all $\lambda\geq 1$ . It is equivalent to

[TABLE]

for $\lambda\in[n,n+1)$ for all the integers $n\geq 1$ .

Since the function $2\sqrt{2\lambda}e^{-\lambda}\frac{\lambda^{n}}{n!}$ is monotonically increasing for $\lambda\in[n,n+1/2]$ , and monotonically decreasing for $\lambda\in[n+1/2,n+1)$ , it suffices to consider integers $\lambda$ . Hence, it suffices to show for any integer $n\geq 1$ ,

[TABLE]

which is equivalent to

[TABLE]

It follows from [36] that for any positive integer $n$ ,

[TABLE]

which implies (624) since $\sqrt{2\pi}e^{\frac{1}{12n}}<\sqrt{8}$ for all positive integers.

D-I Proof of Lemma 25

We first assume $q\geq p$ . Applying the relation

[TABLE]

where $(x)_{+}=\max\{x,0\},(x)_{-}=-\min\{x,0\}$ , we have

[TABLE]

Construct random variable $\hat{p}$ such that $n\hat{p}\sim\mathsf{Poi}(np)$ is on the same probability space as $\hat{q}$ , with the relationship $n\hat{q}=n\hat{p}+Z$ , where $Z$ is independent of $\hat{p}$ and $Z\sim\mathsf{Poi}(n(q-p))$ . Hence, $\hat{q}\geq\hat{p}$ with probability one. We have

[TABLE]

where we applied Lemma 24 in the last step. The case of $q\leq p$ can be proved analogously.

Regarding the lower bound, we have

[TABLE]

where we lower bound via taking $q=p$ and using Lemma 24.

D-J Proof of Lemma 26

For $n\hat{q}\sim\mathsf{Poi}(nq)$ ,

[TABLE]

where we used the fact that $||a|-|b||\leq|a-b|$ .

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal , vol. 27, pp. 379–423, 623–656, 1948.
2[2] T. M. Cover and J. A. Thomas, Elements of Information Theory , 2nd ed. New York: Wiley, 2006.
3[3] E. L. Lehmann and J. P. Romano, Testing statistical hypotheses . Springer, 2005.
4[4] L. Devroye, L. Györfi, and G. Lugosi, “A probabilistic theory of pattern recognition,” 1996.
5[5] G. Valiant and P. Valiant, “The power of linear estimators,” in Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual Symposium on . IEEE, 2011, pp. 403–412.
6[6] P. Valiant and G. Valiant, “Estimating the unseen: improved estimators for entropy and other properties,” in Advances in Neural Information Processing Systems , 2013, pp. 2157–2165.
7[7] J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Minimax estimation of functionals of discrete distributions,” Information Theory, IEEE Transactions on , vol. 61, no. 5, pp. 2835–2885, 2015.
8[8] Y. Wu and P. Yang, “Minimax rates of entropy estimation on large alphabets via best polynomial approximation,” IEEE Transactions on Information Theory , vol. 62, no. 6, pp. 3702–3720, 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Minimax Estimation of the L1L_{1}L1​ Distance

Abstract

Index Terms:

I Introduction

I-A Problem formulation

I-B Approximation-based method

II Divergence Estimation with Known QQQ

II-A Performance of the MLE

Theorem 1**.**

Corollary 1**.**

II-B Construction of the optimal estimator

Lemma 1**.**

Lemma 2**.**

Estimator Construction 1**.**

Theorem 2**.**

Remark 1**.**

II-C Minimax lower bound

Theorem 3**.**

Theorem 4**.**

III Divergence Estimation with Unknown QQQ

III-A Performance of the MLE

Theorem 5**.**

III-B Construction of the optimal estimator

Lemma 3**.**

Lemma 4**.**

Remark 2**.**

Remark 3**.**

Estimator Construction 2**.**

Theorem 6**.**

IV Comparison with Other Approaches

IV-A Approximation only around the origin

Lemma 5**.**

IV-B One-dimensional approximation in the 2D case

Lemma 6**.**

IV-C Approximation on the entire 2D stripe

Lemma 7**.**

IV-D The failure of any plug-in estimator

Lemma 8**.**

V Experimental Results

VI Acknowledgements

Appendix A Auxiliary Lemmas

Lemma 9**.**

Lemma 10**.**

Lemma 11**.**

Lemma 12**.**

Lemma 13** (Markov’s inequality).**

Lemma 14**.**

Lemma 15**.**

Lemma 16**.**

Lemma 17**.**

Lemma 18**.**

Lemma 19**.**

Lemma 20**.**

Lemma 21**.**

Lemma 22**.**

Lemma 23**.**

Lemma 24**.**

Lemma 25**.**

Lemma 26**.**

Appendix B Proofs of main theorems

B-A Proof of Theorem 1

B-B Proof of Theorem 2

Lemma 27**.**

Proof.

B-C Proof of Theorem 3

Lemma 28**.**

Lemma 29**.**

Lemma 30**.**

Lemma 31**.**

Lemma 32**.**

Lemma 33**.**

Proof.

B-D Proof of Theorem 6

Lemma 34**.**

Minimax Estimation of the $L_{1}$ Distance

II Divergence Estimation with Known $Q$

Theorem 1.

Corollary 1.

Lemma 1.

Lemma 2.

Estimator Construction 1.

Theorem 2.

Remark 1.

Theorem 3.

Theorem 4.

III Divergence Estimation with Unknown $Q$

Theorem 5.

Lemma 3.

Lemma 4.

Remark 2.

Remark 3.

Estimator Construction 2.

Theorem 6.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.

Lemma 11.

Lemma 12.

Lemma 13 (Markov’s inequality).

Lemma 14.

Lemma 15.

Lemma 16.

Lemma 17.

Lemma 18.

Lemma 19.

Lemma 20.

Lemma 21.

Lemma 22.

Lemma 23.

Lemma 24.

Lemma 25.

Lemma 26.

Lemma 27.

Lemma 28.

Lemma 29.

Lemma 30.

Lemma 31.

Lemma 32.

Lemma 33.

Lemma 34.

Lemma 35.