Direct Estimation of Information Divergence Using Nearest Neighbor   Ratios

Morteza Noshad; Kevin R. Moon; Salimeh Yasaei Sekeh; Alfred O. Hero; III

arXiv:1702.05222·cs.IT·November 22, 2017

Direct Estimation of Information Divergence Using Nearest Neighbor Ratios

Morteza Noshad, Kevin R. Moon, Salimeh Yasaei Sekeh, Alfred O. Hero, III

PDF

TL;DR

This paper introduces a graph-based method for directly estimating Rényi and f-divergences from sample data, achieving optimal convergence rates and improved computational efficiency over existing techniques.

Contribution

The authors develop a novel graph-theoretic estimator for divergence measures that attains parametric convergence rates and is more computationally efficient than previous methods.

Findings

01

Estimator achieves MSE rate of O(N^{-2γ/(γ+d)}) for γ-Hölder smooth functions.

02

Ensemble estimator attains parametric MSE rate of O(1/N) under certain conditions.

03

Method is computationally more tractable than competing divergence estimators.

Abstract

We propose a direct estimation method for R\'{e}nyi and f-divergence measures based on a new graph theoretical interpretation. Suppose that we are given two sample sets $X$ and $Y$ , respectively with $N$ and $M$ samples, where $η := M / N$ is a constant value. Considering the $k$ -nearest neighbor ( $k$ -NN) graph of $Y$ in the joint data set $(X, Y)$ , we show that the average powered ratio of the number of $X$ points to the number of $Y$ points among all $k$ -NN points is proportional to R\'{e}nyi divergence of $X$ and $Y$ densities. A similar method can also be used to estimate f-divergence measures. We derive bias and variance rates, and show that for the class of $γ$ -H\"{o}lder smooth functions, the estimator achieves the MSE rate of $O (N^{- 2 γ / (γ + d)})$ . Furthermore, by using a weighted ensemble estimation technique, for density functions with continuous and bounded…

Equations200

D_{α} (f_{1} (x) ∣∣ f_{2} (x))

D_{α} (f_{1} (x) ∣∣ f_{2} (x))

= \frac{1}{α - 1} lo g J_{α} (f_{1}, f_{2}),

D_{g} (f_{1} (x) ∣∣ f_{2} (x))

D_{g} (f_{1} (x) ∣∣ f_{2} (x))

= E_{f_{2}} [g (\frac{f _{1} ( x )}{f _{2} ( x )})],

∣ f (y) - f (x) ∣ \leq G_{f} ∥ y - x ∥^{γ},

∣ f (y) - f (x) ∣ \leq G_{f} ∥ y - x ∥^{γ},

D_{α} (X, Y) := \frac{1}{( α - 1 )} lo g [\frac{η ^{α}}{M} i = 1 \sum M (\frac{N _{i}}{M _{i} + 1})^{α}],

D_{α} (X, Y) := \frac{1}{( α - 1 )} lo g [\frac{η ^{α}}{M} i = 1 \sum M (\frac{N _{i}}{M _{i} + 1})^{α}],

J_{α} (X, Y) := \frac{η ^{α}}{M} i = 1 \sum M (\frac{N _{i}}{M _{i} + 1})^{α} .

J_{α} (X, Y) := \frac{η ^{α}}{M} i = 1 \sum M (\frac{N _{i}}{M _{i} + 1})^{α} .

min {max {D_{α} (X, Y), 0}, \frac{1}{∣1 - α ∣} lo g (\frac{C _{U}}{C _{L}})} .

min {max {D_{α} (X, Y), 0}, \frac{1}{∣1 - α ∣} lo g (\frac{C _{U}}{C _{L}})} .

D_{g} (X, Y) := max {\frac{1}{M} i = 1 \sum M g (\frac{η N _{i}}{M _{i} + 1}), 0},

D_{g} (X, Y) := max {\frac{1}{M} i = 1 \sum M g (\frac{η N _{i}}{M _{i} + 1}), 0},

B [D_{α} (X, Y)] = O ((\frac{k}{N})^{γ / d}) + O (\frac{1}{k}) .

B [D_{α} (X, Y)] = O ((\frac{k}{N})^{γ / d}) + O (\frac{1}{k}) .

V [D_{α} (X, Y)] \leq O (\frac{1}{N}) + O (\frac{1}{M}) .

V [D_{α} (X, Y)] \leq O (\frac{1}{N}) + O (\frac{1}{M}) .

w min

w min

l \in L \sum w (l) = 1,

l \in L \sum w (l) l^{i / d} = 0, i \in N, i \leq d .

E_{ρ_{k} (x)} [y \in B (x, ρ_{k} (x)) sup ∣ f (y) - f (x)∣] \leq ϵ_{γ, k},

E_{ρ_{k} (x)} [y \in B (x, ρ_{k} (x)) sup ∣ f (y) - f (x)∣] \leq ϵ_{γ, k},

E [g (T) - g (T)] \leq H_{g} (V [T] + B [T]) .

E [g (T) - g (T)] \leq H_{g} (V [T] + B [T]) .

B [D_{α} (X, Y)] \leq C B [J_{α} (X, Y)] + V [J_{α} (X, Y)],

B [D_{α} (X, Y)] \leq C B [J_{α} (X, Y)] + V [J_{α} (X, Y)],

E [J_{α} (X, Y)]

E [J_{α} (X, Y)]

= η^{α} E_{Y_{1} \sim f_{2} (x)} E [(\frac{N _{1}}{M _{1} + 1})^{α} Y_{1}] .

Pr (Q_{k} (Y_{1}) \in X)

Pr (Q_{k} (Y_{1}) \in X)

Pr (Q_{k} (Y_{1}) \in Y)

E [\frac{N _{1}}{M _{1} + 1} Y_{1}] = E [N_{1} ∣ Y_{1}] E [(M_{1} + 1)^{- 1} Y_{1}] .

E [\frac{N _{1}}{M _{1} + 1} Y_{1}] = E [N_{1} ∣ Y_{1}] E [(M_{1} + 1)^{- 1} Y_{1}] .

E [N_{1} ∣ Y_{1}]

E [N_{1} ∣ Y_{1}]

= k \frac{f _{1} ( Y _{1} )}{f _{1} ( Y _{1} ) + η f _{2} ( Y _{1} )} + O (k ϵ_{γ, k}) .

E [M_{1} ∣ Y_{1}] = \frac{k η f _{2} ( Y _{1} )}{f _{1} ( Y _{1} ) + η f _{2} ( Y _{1} )} + O (k ϵ_{γ, k}) .

E [M_{1} ∣ Y_{1}] = \frac{k η f _{2} ( Y _{1} )}{f _{1} ( Y _{1} ) + η f _{2} ( Y _{1} )} + O (k ϵ_{γ, k}) .

E [(U + 1)^{- 1}] = \frac{1}{λ} (1 - e^{- λ}) .

E [(U + 1)^{- 1}] = \frac{1}{λ} (1 - e^{- λ}) .

E [(M_{1} + 1)^{- 1} Y_{1}]

E [(M_{1} + 1)^{- 1} Y_{1}]

= k^{- 1} [\frac{η f _{2} ( Y _{1} )}{f _{1} ( Y _{1} ) + η f _{2} ( Y _{1} )} + O (ϵ_{γ, k})]^{- 1} + O (\frac{e ^{- v k}}{k}),

E [\frac{N _{1}}{M _{1} + 1} Y_{1}] = \frac{f _{1} ( Y _{1} )}{η f _{2} ( Y _{1} )} + O (ϵ_{γ, k}) + O (e^{- v k}) .

E [\frac{N _{1}}{M _{1} + 1} Y_{1}] = \frac{f _{1} ( Y _{1} )}{η f _{2} ( Y _{1} )} + O (ϵ_{γ, k}) + O (e^{- v k}) .

E [(\frac{N _{1}}{M _{1} + 1})^{α} Y_{1}] = η^{- α} (\frac{f _{1} ( Y _{1} )}{f _{2} ( Y _{1} )})^{α} +

E [(\frac{N _{1}}{M _{1} + 1})^{α} Y_{1}] = η^{- α} (\frac{f _{1} ( Y _{1} )}{f _{2} ( Y _{1} )})^{α} +

+ O (ϵ_{γ, k}) + O (e^{- v k}) + O (N^{- \frac{1}{2}}) .

B [\overline{J}_{α} (X, Y)] = O (ϵ_{γ, k}) + O (e^{- v k}) + O (N^{- \frac{1}{2}}) .

B [\overline{J}_{α} (X, Y)] = O (ϵ_{γ, k}) + O (e^{- v k}) + O (N^{- \frac{1}{2}}) .

E [J_{α} (X, Y)] = E [\overline{J}_{α} (X, Y)] + O (1/ k) .

E [J_{α} (X, Y)] = E [\overline{J}_{α} (X, Y)] + O (1/ k) .

S_{k} (x) := {y : d (x, y) \leq ρ_{k} (x)} .

S_{k} (x) := {y : d (x, y) \leq ρ_{k} (x)} .

α_{k} (x) := \frac{\int _{S_{k} (x) \cap X} d z}{\int _{S_{k} (x)} d z} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Direct Estimation of Information Divergence Using Nearest Neighbor Ratios

Morteza Noshad [email protected] University of Michigan, Electrical Engineering and Computer Science, Ann Arbor, Michigan, U.S.A

Kevin R. Moon [email protected] Yale University, Genetics and Applied Math Departments, New Haven, Connecticut, U.S.A

Salimeh Yasaei Sekeh [email protected] University of Michigan, Electrical Engineering and Computer Science, Ann Arbor, Michigan, U.S.A

Alfred O. Hero III [email protected] University of Michigan, Electrical Engineering and Computer Science, Ann Arbor, Michigan, U.S.A

Abstract

We propose a direct estimation method for Rényi and f-divergence measures based on a new graph theoretical interpretation. Suppose that we are given two sample sets $X$ and $Y$ , respectively with $N$ and $M$ samples, where $\eta:=M/N$ is a constant value. Considering the $k$ -nearest neighbor ( $k$ -NN) graph of $Y$ in the joint data set $(X,Y)$ , we show that the average powered ratio of the number of $X$ points to the number of $Y$ points among all $k$ -NN points is proportional to Rényi divergence of $X$ and $Y$ densities. A similar method can also be used to estimate f-divergence measures. We derive bias and variance rates, and show that for the class of $\gamma$ -Hölder smooth functions, the estimator achieves the MSE rate of $O\!\left({N^{-2\gamma/(\gamma+d)}}\right)$ . Furthermore, by using a weighted ensemble estimation technique, for density functions with continuous and bounded derivatives of up to the order $d$ , and some extra conditions at the support set boundary, we derive an ensemble estimator that achieves the parametric MSE rate of $O(1/N)$ . Our estimator requires no boundary correction, and remarkably, the boundary issues do not show up. Our approach is also more computationally tractable than other competing estimators, which makes them appealing in many practical applications.

11footnotetext: This research was partially supported by ARO grant W911NF-15-1-0479.

I Introduction

Shannon entropy, mutual information, and the Kullback-Leibler (KL) divergence are major information theoretic measures. Shannon entropy can measure diversity or uncertainty of samples, while KL-divergence is a measure of dissimilarity, and mutual information is a measure of dependency between two probability distributions [1]. Rényi proposed a divergence measure which generalizes KL-divergence [2]. F-divergence is another general family which is also well studied, and comprises many important divergence measures such as KL-divergence, total variation distance, and $\alpha$ -divergence [3]. These measures have wide range of applications in information and coding theory, statistics and machine learning [1, 4, 5].

A major class of estimators for these measures is called non-parametric, for which minimal assumptions on the density functions are considered in contrast to parametric estimators. An approach used for this class is plug-in estimation, in which we find an estimate of a distribution function and then plug it in the measure function. $k$ -Nearest Neighbor ( $K$ -NN) and Kernel Density Estimator (KDE) methods are examples of this approach. Another approach is direct estimation, in which we find a relationship between the measure function and a functional in Euclidean space. In a seminal work in 1959, Beardwood et al derived the asymptotic behavior of the weighted functional of minimal graphs such as $K$ -NN and TSP of $N$ i.i.d random points [6]. They showed that the sum of weighted edges of these graphs converges to the integral of a weighted density function, which can be interpreted as Rényi entropy. Since then, this work has been of great interest in signal processing and machine learning communities. More recent studies of direct graph theoretical approaches include the estimation of Rényi entropy using the minimal graphs [7], in which the authors investigate the convergence rates, as well as the estimation of Henze-Penrose divergence using MST graphs [8]. Yet the extension to Rényi divergence and f-divergences has remained an open question. Moreover, among various estimators of information measures, developing accurate and computationally tractable approaches has been often a challenge. Therefore, for practical and computational reasons, direct graphical algorithms have been under attention in the literature including this work.

In this work, we propose an estimation method for Rényi and f-divergences based on a direct graph estimation method. We show that given two sample sets $X$ and $Y$ with respective densities of $f_{1}$ and $f_{2}$ , and the $k$ -nearest neighbor ( $k$ -NN) graph of $Y$ in the joint data set $(X,Y)$ , the average powered ratio of the number of $X$ points to the number of $Y$ points among all $k$ -NN points converges to the Rényi divergence. Using this fact, we design a consistent estimator for the Rényi and f-divergences.

Unlike most distance-based divergence estimators, our proposed estimator can use non-Euclidean metrics, which makes this estimator appealing in many information theoretic and machine learning applications. Our estimator requires no boundary correction, and surprisingly, the boundary issues do not show up. This is because the proposed estimator automatically cancels the extra bias of the boundary points in the ratio of nearest neighbor points. Our approach is more computationally tractable than other estimators, with a time complexity of $O(kN\log N)$ , required to construct the $k$ -NN graph [9]. For example for $k=N^{1/{d+1}}$ we get the complexity of $O(N^{(d+2)/(d+1)}\log N)$ . We show that for the class of $\gamma$ -Hölder smooth functions, the estimator achieves the MSE rate of $O(N^{-2\gamma/(\gamma+d)})$ . Furthermore, by using the theory of optimally weighted ensemble estimation [10, 5], for density functions with continuous and bounded derivatives of up to the order $d$ , and some extra conditions at the support set boundary, we derive an ensemble estimator that achieves the optimal MSE rate of $O(1/N)$ , which is independent of the dimension. Finally, the current work is an important step towards extending the direct estimation method studied in [11, 12] to more general information theoretic measures.

Several previous works have investigated an estimator for a particular type of divergence measures. $k$ -NN [13], KDE [14], and histogram [15] estimators are among the studied plug-in estimators for the f-divergence family. In general, most of these estimators suffer from several restrictions such as lack of analytic convergence rates, or high computational complexity.

Recent works have focused on the MSE convergence rates for plug-in divergence estimators, such as KDE. Singh and Póczos proposed estimators for general density functionals and Rényi divergence, based on the kernel density plug-in estimator [14][16], which can achieve the convergence rate of $O(1/N)$ when the densities are at least $d$ times differentiable. In a similar approach, Kandasamy et al proposed another KDE-based estimator for general density functionals and divergence measures, which can achieve the convergence rate of $O(1/N)$ when the densities are at least $d/2$ differentiable [17].

Moon et al proposed simple kernel density plug-in estimators using weighted ensemble methods to improve the rate [10][18]. The proposed estimator can achieve the convergence rate when the densities are at least $(d+1)/2$ times differentiable. The main drawback of these estimators is handling the bias at the support set boundary. For example, using the estimators proposed in [14, 17] requires knowledge of the densities’ support set and numerous computations at the support boundary, which become complicated when the dimension increases. To circumvent this issue, Moon et al [10] assumed smoothness conditions at the support set boundary, which may not always be true in practice. In contrast, our basic estimator does not require any smoothness assumptions on the support set boundary although our ensemble estimator does. Regarding the algorithm time complexities, our estimator spends $O(kN\log N)$ time versus the time complexity of KDE based estimators which spend $O(N^{2})$ time.

A rather different method for estimating f-divergences is suggested by Nguyen et al [19], which is based on a variational representation of f-divergences that connects the estimation problem to a convex risk minimization problem. This approach achieves the parametric rate of $O(1/N)$ when the likelihood ratio is at least $d/2$ times differentiable. However, the algorithm’s time complexity is even worse than $O(N^{2})$ .

II A direct estimator of divergence measures

In this section, we first introduce the Rényi and f-divergence measures. Then we propose an estimator based on a graph theoretical interpretation, and we outline our main theoretical results, which will be proven in section III.

Consider two density functions $f_{1}$ and $f_{2}$ with support $\mathcal{M}\subseteq\mathbb{R}^{d}$ . The Rényi divergence between $f_{1}$ and $f_{2}$ is

[TABLE]

where in the second line, $J_{\alpha}(f_{1},f_{2})$ is defined as $J_{\alpha}(f_{1},f_{2}):=\mathbb{E}_{f_{2}}\left[(\frac{f_{1}(x)}{f_{2}(x)})^{\alpha}\right]$ :

Another general divergence family, f-divergence, is also defined as follows [3].

[TABLE]

where $g$ is a smooth and convex function such that $g(1)=0$ . KL-divergence, Hellinger distance and total variation distance are particular cases of this family. Note that for our approach, we only assume that $g$ is smooth.

We assume that the densities are lower bounded by $C_{L}>0$ and upper bounded by $C_{U}$ . Also $f_{1}$ and $f_{2}$ belong to Hölder smoothness class with parameter $\gamma$ :

Definition

Given a support $\mathcal{X}\subseteq\mathbb{R}^{d}$ , a function $f:\mathcal{X}\to\mathbb{R}$ is called Hölder continuous with parameter $0<\gamma\leq 1$ , if there exists a positive constant $G_{f}$ , depending on $f$ , such that

[TABLE]

for every $x\neq y\in\mathcal{X}$ .

The function $g(x)$ in (II) is also assumed to be Lipschitz continuous; i.e. $g$ is Hölder continuous with $\gamma=1$ .

Remark 1

$\gamma$ -Hölder smoothness family comprises a large class of continuous functions including continuously differentiable functions and Lipschitz continuous functions. Also note that for $\gamma>1$ , any $\gamma$ –Hölder continuous function on any bounded and continuous support is constant.

Nearest Neighbor Ratio (NNR) Estimator:

Consider the i.i.d samples $X=\left\{X_{1},...,X_{N}\right\}$ drawn from $f_{1}$ and $Y=\left\{Y_{1},...,Y_{M}\right\}$ drawn from $f_{2}$ . We define the set $Z:=X\cup Y$ , and consider the $k$ -NN points for each of the points $Y_{i}$ in the set $Y$ , which is represented by $Q_{k}(Y_{i})$ . Let $N_{i}$ and $M_{i}$ be the number of points of the sets $X$ and $Y$ among the $k$ NN points of $Y_{i}$ , respectively. Then an estimator for Rényi divergence is

[TABLE]

where $\eta:=M/N$ . Similarly, using the alternative form in (II), we have

[TABLE]

Note that the estimator defined in (4) can be negative and unstable in extreme cases. To correct this, we propose the NNR estimator for Rényi divergence denoted by $\widehat{D}_{\alpha}(X,Y)$ :

[TABLE]

The NNR f-divergence estimator is defined as

[TABLE]

where $\widetilde{g}(x):=\max\left\{g(x),g\!\left({C_{L}/C_{U}}\right)\right\}$ .

The intuition behind the proposed estimators is that, the ratio $\frac{N_{i}}{M_{i}+1}$ can be considered an estimate of density ratios at $Y_{i}$ . Note that if the densities $f_{1}$ and $f_{2}$ are almost equal, then for each point $Y_{i}$ , $N_{i}\approx M_{i}+1$ , and therefore both $\widehat{D}_{\alpha}(X,Y)$ and $\widehat{D}_{g}(X,Y)$ tend to zero. In the following theorems we derive upper bounds on the bias and variance rates. Consider the bias and variance definitions as $\mathbb{B}[\hat{T}]=\mathbb{E}[\hat{T}]-T$ and $\mathbb{V}[\hat{T}]=\mathbb{E}[\hat{T}^{2}]-\mathbb{E}[\hat{T}]^{2}$ , respectively, where $\hat{T}$ is an estimator of the parameter $T$ .

Theorem II.1

The bias of NNR estimator for Rényi divergence, defined in (6), can be bounded as

[TABLE]

Here $\gamma$ is the Hölder smoothness parameter.

Theorem II.2

The variance of the NNR estimator is

[TABLE]

Remark 2

The same variance bound holds true for the RV $\widehat{J}_{\alpha}(X,Y)$ . Also bias and variance results easily extend to the f-divergence estimator.

Remark 3

Note that in most cases, the $1/k$ term in (8) is the dominant error term, and in order to have an asymptotically unbiased NNR estimator, $k$ should be a growing function of $N$ . The $1/k$ term actually comes from the error of Poissonization technique used in the proof. By equating the terms $O\!\left({k/N)^{\gamma/d}}\right)$ and $O(1/k)$ , it turns out that for $k_{opt}=O\!\left({N^{\frac{\gamma}{d+\gamma}}}\right)$ , we get the optimal MSE rate of $O\!\left({N^{\frac{-2\gamma}{d+\gamma}}}\right)$ . The optimal choice for $k$ can be compared to the optimum value $k=O\!\left({\sqrt{N}}\right)$ in [4], where a plug-in KNN estimator is used. Also considering the computational complexity of $O(kN\log N)$ to construct the $k$ -NN graph [9], we see that there is a trade-off between MSE rate and complexity for different values of $k$ . In the particular case of optimal MSE, the computational complexity of this method is $O\!\left({N^{\frac{d+2\gamma}{d+\gamma}}\log N}\right)$ .

Under extra conditions on the densities and support set boundary, we can improve the bias rate by applying the ensemble theory in [10, 5]. Assume that the density functions are in the Hölder space $\Sigma(\gamma,L)$ , which consists of functions on $\mathcal{X}$ continuous derivatives up to order $q=\left\lfloor\gamma\right\rfloor\geq d$ and the $q$ th partial derivatives are Hölder continuous with exponent $\gamma^{\prime}=:\gamma-q$ . We also assume that the density derivatives up to order $d$ vanish at the boundary. Let $\mathcal{L}:=\left\{l_{1},...,l_{L}\right\}$ be a set of index values with $l_{i}<c$ . Let $k(l):=\left\lfloor l\sqrt{N}\right\rfloor$ . The weighted ensemble estimator is defined as $\widehat{D}_{w}:=\sum_{l\in\mathcal{L}}w(l)\widehat{D}_{k(l)}$ , where $\widehat{D}_{k(l)}$ is the NNR estimator of Rényi $\alpha$ -divergence, using the $k(l)$ -NN graph.

Theorem II.3

Let $L>d$ and $w_{0}$ be the solution to:

[TABLE]

Then the MSE rate of the ensemble estimator $\widehat{D}_{w_{0}}$ is $O(1/N)$ .

III Proof

In this section we derive the bias terms of NNR estimator. The variance bound for NNR estimator is more straightforward and can be derived using Efron-Stein inequality. Also for proving the MSE rate of ensemble variant of the NNR estimator, we need more accurate bias rates, which is provided in the arXiv version. So, for variance and ensemble estimation proofs we refer the reader to the Appendix section of arXiv version of the paper. First, we provide a smoothness lemma for the densities. Unless stated otherwise, all proofs of lemmas are provided in the arXiv version.

Lemma III.1

Suppose that the density function $f(x)$ belongs to the $\gamma$ -Hölder smoothness class. Then if $B(x,r)$ denotes the sphere with center $x$ and radius $r=\rho_{k}(x)$ , where $\rho_{k}(x)$ is defined as the $k$ -NN distance on the point $x$ , we have the following smoothness condition:

[TABLE]

where $O\!\left({(k/N)^{\gamma/d}}\right)+O\!\left({\mathcal{C}(k)}\right)$ , and we have $\mathcal{C}(k):=exp(-3k^{1-\delta})$ for a fixed $\delta\in(2/3,1)$ .

We first state the bias proof for Rényi divergence, and then we extend the method to f-divergence. It is easier to work with $\widehat{J}_{\alpha}(X,Y)$ defined in (5), instead of $\widehat{D}_{\alpha}(X,Y)$ . The following lemma provides the essential tool to make a relation between $\mathbb{B}\!\left({\widehat{D}}\right)$ and $\mathbb{B}\!\left({\widehat{J}}\right)$ .

Lemma III.2

Assume that $g(x):\mathcal{X}\to\mathbb{R}$ is Lipschitz continuous with constant $H_{g}>0$ . If $\widehat{T}$ is a RV estimating a constant value $T$ with the bias $\mathbb{B}[\widehat{T}]$ and the variance $\mathbb{V}[\widehat{T}]$ , then the bias of $g(\widehat{T})$ can be upper bounded by

[TABLE]

An immediate consequence of this lemma is

[TABLE]

where $C$ is a constant.

From theorem II.2, $\mathbb{V}\!\left[{\widehat{J}_{\alpha}(X,Y)}\right]=O(1/N)$ , so we only need to bound $\mathbb{B}\!\left[{\widehat{J}_{\alpha}(X,Y)}\right]$ . If $\eta:=M/N$ , we have:

[TABLE]

Now note that $N_{1}$ and $M_{1}$ are not independent since $N_{1}+M_{1}=k$ . We use the Poissonizing technique [20][21] and assume that $N_{1}+M_{1}=K$ , where $K$ is a Poisson random variable with mean $k$ . We represent the Poissonized variant of $\widehat{J}_{\alpha}(X,Y)$ by $\overline{J}_{\alpha}(X,Y)$ , and we will show that $\mathbb{E}\left[\widehat{J}_{\alpha}(X,Y)\right]=\mathbb{E}\left[\overline{J}_{\alpha}(X,Y)\right]+O(1/k)$ . By partitioning theorem for a Poisson random variable with Bernoulli trials of probabilities $\Pr\!\left({Q_{i}(Y_{1})\in X}\right)$ and $\Pr\!\left({Q_{i}(Y_{1})\in Y}\right)$ , we argue that $N_{1}$ and $M_{1}$ are two independent Poisson RVs. We first compute $Pr\!\left({Q_{k}(Y_{1})\in X}\right)$ and $Pr\!\left({Q_{k}(Y_{1})\in Y}\right)$ as follows:

Lemma III.3

Let $\eta:=M/N$ . The probability that the point $Q_{k}(Y_{1})$ respectively belongs to the sets $X$ and $Y$ is equal to

[TABLE]

Using the conditional independence of $N_{1}$ and $M_{1}$ we write

[TABLE]

$\mathbb{E}\left[N_{1}|Y_{1}\right]$ can be simplified as

[TABLE]

Also similarly,

[TABLE]

Lemma III.4

If $U$ is a Poisson random variable with the mean $\lambda>1$ , then

[TABLE]

Using this lemma for $M_{1}$ yields

[TABLE]

here $v$ is some positive constant. Therefore, (16) becomes

[TABLE]

Using lemma III.2 and theorem II.2, we obtain

[TABLE]

By applying an equation similar to (III), we get

[TABLE]

Lemma III.5

De-Poissonizing $\overline{J}_{\alpha}(X,Y)$ adds $O(\frac{1}{k})$ error:

[TABLE]

At this point the bias proof of NNR estimator for Rényi divergence is complete, and since $O\!\left({e^{-vk}}\right)$ and $O\!\left({N^{-{\frac{1}{2}}}}\right)$ are of higher order compared to $O\!\left({\epsilon_{\gamma,k}}\right)$ , we obtain the final bias rate in (8). The bias proof of NNR estimator for f-divergence is similar, and by using the lemma III.2 for $g$ , we can follow the same steps to prove the bias bound. The complete proof is provided in the arXiv version.

IV numerical Results

In this section we provide numerical results to show the consistency of the proposed estimator and compare the estimation quality in terms of different parameters such as $N$ and $k$ . In our experiments, we choose i.i.d samples for $X$ and $Y$ from different independent distributions such as Gaussian, truncated Gaussian and uniform functions.

The first experiment, shown in Figure 1, shows the mean estimated KL-divergence as N grows for $k$ equal to $20,40,60$ . The divergence measure is between a 2D Gaussian RV with mean $[0,0]$ and variance of $2I_{2}$ , and a uniform distribution with $x,y\in[-1,1]$ . For each case we repeat the experiment $100$ times, and compute the mean of the estimated value and the standard deviation error bars. For small sample sizes, smaller $k$ results in smaller bias error, which is due to the $\!\left({\frac{k}{N}}\right)^{\gamma/d}$ bias term. As $N$ grows, we get larger bias for small values of $k$ , which is due to the fact that the $\!\left({1/k}\right)$ term dominates. If we compare the standard deviations for different values of $k$ at $N=4000$ , they are almost equal, which verifies the fact that variance is independent of $k$ .

Figure 2 shows the MSE of NNR estimator of Renyi divergence with $\alpha=0.5$ for two independent, truncated normal RVs. The RVs are 2D with means $\mu_{1}=\mu_{2}=[0,0]$ and covariance matrices $\sigma_{1}=I_{2}$ and $\sigma_{2}=3I_{2}$ , where $I_{2}$ is a diagonal matrix of size $2$ . Both of the RVs are truncated with the range $x\in[-2,2]$ and $y\in[-2,2]$ . In this figure we show the MSE for three different sample sizes of $100,200$ , and $300$ for different values of $k$ . As $k$ increases initially, MSE decreases due to the $O(1/k)$ bias term. After reaching an optimal point, MSE increases as $k$ increases, indicating that the other bias terms begin to dominate. The optimal $k$ increases with the sample size which validates our theory.

Figure 3 shows the MSE of the NNR estimator of Rényi divergence with $\alpha=2$ versus $N$ , for two i.i.d. Normal RVs for three different dimension sizes: $2,4$ , and $8$ . $k=90$ is fixed so that the $O\!\left({1/k}\right)$ term in the bias can be ignored relative to the $O\!\left({(k/N)^{\gamma/d}}\right)$ term. As dimension grows, the MSE decreases almost linearly in the logarithmic scale, which verifies the $O\!\left({(k/N)^{\gamma/d}}\right)$ bias term.

Finally in Figure 4, we compare our estimator with two standard plug-in estimators, $k$ -NN, KDE. For each of these estimators we estimate the density at each $y\in Y$ , and then compute the relation for the divergence measure using the definition in (II). The graph shows the MSE for Rényi divergence ( $\alpha=0.5$ ) between two Gaussian random variables with the same mean and different variances ( $\sigma_{1}^{2}=I_{2},\sigma_{2}^{2}=3I_{2}$ ) as a function of sample size, $N$ . For both the NNR and $k$ -NN estimators we use the optimal value for $k$ and the optimal bandwidth for the KDE estimator. According to this figure, the NNR estimator outperforms the other methods.

V Conclusion

In this paper we proposed a direct estimation method for Rényi and f-divergence measures based on a new graph theoretical interpretation. We proved bias and variance convergence rates, and validated our results by numerical experiments. Direct estimation procedures that converge for a fixed number $k$ of nearest neighbors is a worthwhile topic for future work.

A. Bias Proof

In this section we give proofs for the Lemmas III.1, III.2, III.3, III.4 and III.5.

For proving Lemma III.1, we need to derive a bound on the moments of $k$ -NN distances. We define the $k$ -NN ball centered at $x$ as

[TABLE]

Let $\textbf{V}_{k,N}(x)$ denote the volume of the $k$ -NN ball with $N$ samples. Set

[TABLE]

Let $\mathcal{X_{I}}$ and $\mathcal{X_{B}}$ respectively denote the interior support and boundary of the support. For a point $x\in\mathcal{X}_{I}$ we have $\alpha_{k}=1$ , and for $x\in\mathcal{X_{B}}$ we have $\alpha_{k}<1$ . Note that the definition of interior and boundary points depends on $k$ and $N$ .

Lemma V.1

We have the following relation for any $t\in\mathbb{R}$ and for each point $x\in\mathcal{X_{I}}$ with density $f(x)$ :

[TABLE]

where $u(x)=g^{\prime}(f(x))h(x)$ ,and $h$ is some bounded function of the density which is defined in [22].

Proof:

We start with a result from [22], A.25. Let $g:\mathbb{R}^{+}\to\mathbb{R}$ be some arbitrary function, then we have the following relation

[TABLE]

where $g_{1}$ and $g_{2}$ are bias correction functions which depend on $g$ . We also have $\mathcal{C}(k):=exp(-3k^{1-\delta})$ for a fixed $\delta\in(2/3,1)$ . For example, if we set $k=(\log(N))^{1/(1-\delta)}$ , then $O(\mathcal{C}(k))=O(1/N^{3})$ . Note that this term is negligible compared to other bias terms in our work.

Now according to [22], if we set $g(x)=x^{-\beta}$ , then we have $g_{1}(k,N)=\frac{\Gamma(k)}{\Gamma(k-\beta)(k-1)^{\beta}}$ and $g_{2}(k,N)=0$ , which yields

[TABLE]

Finally, using the approximation $\frac{\Gamma(k)}{\Gamma(k-\beta)}=k^{\beta}+O(1/k)$ results in (26). ∎

Now for the case of a bounded support, we derive an upper bound on $k$ -NN distances for the points at the boundary:

Lemma V.2

For every point $x\in\mathcal{X_{B}}$ and any $t\in\mathbb{R}$ we have

[TABLE]

Proof:

Define ${V}_{k,N}(x):=\frac{k}{N\alpha_{k}(x)f(x)}$ . Let $p(k,N)$ denote any positive function satisfying $p(k,N)=\Theta\!\left({(k/N)^{2/d}}\right)$ + $\frac{\sqrt{6}}{k^{\delta/2}}$ for some $\delta>0$ . Further consider the event $E_{1}$ as

[TABLE]

and $E_{2}$ as its complementary event. By using (B.2) in [22] (Appendix B), we have

[TABLE]

Moreover, we can simplify (30) as:

[TABLE]

Further we write $\mathbb{E}\left[\rho_{k}^{\gamma}(x)\right]$ as the sum of conditional expectations:

[TABLE]

where in the second line we have used (31) and also the fact that $\rho_{k}(x)$ is bounded from above because of the bounded support.

∎

Proof:

From definition of Holder smoothness, for every $y\in B(x,\rho_{k}(x))$ we have

[TABLE]

Using Lemmas V.1 and V.2 results in

[TABLE]

where $O\!\left({(k/N)^{\gamma/d}}\right)+O\!\left({\mathcal{C}(k)}\right)$ . Note that all other terms in (26) are of higher order and can be ignored. ∎

Proof:

[TABLE]

In the second line we have used triangle inequality for the first term, and Lipschitz condition for the second term. Again in the third line, we have applied Lipschitz condition for the first term, and finally in the forth line we have used Cauchy-Schwarz inequality.

∎

Proof:

Consider the following lemma which is proved immediately after the proof of Lemma III.3 :

Lemma V.3

Let for any point $y\in\mathcal{X}$ define $\xi_{1}(y):=f_{1}(y)-f_{1}(Y_{1})$ and $\xi_{2}(y):=f_{2}(y)-f_{2}(Y_{1})$ . Then $\Pr\!\left({Q_{k}(Y_{1})\in X}\right)$ can be derived as

[TABLE]

where $\tau_{1}(Y_{1})$ and $\tau_{2}(Y_{1})$ are defined as

[TABLE]

and $\mathcal{U}\!\left({x}\right):=1+\sum_{i=1}^{\infty}(-1)^{i}\!\left({x}\right)^{i}$ .

Now from Lemma III.1 we can simply write $\tau_{1}(Y_{1})=O(\epsilon_{\gamma,k})$ and $\tau_{2}(Y_{1})=O(\epsilon_{\gamma,k})$ which results in:

[TABLE]

Remark 4

It can similarly be proven that

[TABLE]

∎

Proof:

Let $B(Q_{k}(Y_{1}),\epsilon)$ be the sphere with the center $Q_{k}(Y_{1})$ (the $k$ -NN point of $Y_{1}$ ) and some small radius $\epsilon>0$ . Also let $E_{X}$ and $E_{Z}$ denote the following events:

[TABLE]

Let use the notation $\Pr\!\left({E_{X}(y)}\right)$ to denote $\Pr\!\left({E_{X}|Q_{k}(Y_{1})=y}\right)$ .

Suppose $f_{Q_{k}(Y_{1})}$ be the density function of the RV $Q_{k}(Y_{1})$ . Then $Pr\!\left({Q_{k}(Y_{1})\in X}\right)$ can be written as:

[TABLE]

where $Pr\!\left({Q_{k}(Y_{1})\in X|Q_{k}(Y_{1})=y}\right)$ can be formulated using $E_{X}(y)$ and $E_{Y}(y)$ as

[TABLE]

Let $P_{f}\!\left({y,\epsilon}\right)$ denote the probability of the sphere $B\!\left({y,\epsilon}\right)$ with density $f$ . Then there exist a function real function $\Delta_{1}(\epsilon)$ such that for any $\epsilon>0$ we have

[TABLE]

where $c_{d}$ is volume of the unit ball in dimension $d$ . From definition of the density function we have

[TABLE]

So, from (44) and (45) we get $\lim_{\epsilon\to 0}\Delta_{1}(\epsilon)/\epsilon^{d}=0$ .

Now we compute $Pr(E_{X}(y))$ as

[TABLE]

where $\Delta_{2}(\epsilon):=\Delta_{1}(\epsilon)+\sum_{i=2}^{N}(-1)^{i}{N\choose i}P_{f_{1}}\!\left({y,\epsilon}\right)^{i}$ . Note that $\lim_{\epsilon\to 0}\Delta_{2}(\epsilon)/\epsilon^{d}=0$ .

Similarly, for $Pr(E_{z})$ we can prove that

[TABLE]

where $\Delta^{\prime}_{2}(\epsilon)$ is a function satisfying $\lim_{\epsilon\to 0}\Delta^{\prime}_{2}(\epsilon)/\epsilon^{d}=0$ .

From (43), and considering the fact that (Proof:) and (47) hold true for any $\epsilon>0$ , we get

[TABLE]

where $\eta=M/N$ . Considering the Taylor expansion of $\frac{A+a}{B+b}$ for any real number $A,B,a,b$ such that $a\ll A$ and $b\ll B$ , we have

[TABLE]

where $\mathcal{U}\!\left({x}\right):=\sum_{i=1}^{\infty}(-1)^{i}\!\left({x}\right)^{i}$ . Consequently, by using this fact and relation (48) we have

[TABLE]

and $\tau_{1}(Y_{1})$ and $\tau_{2}(Y_{1})$ are given by

[TABLE]

∎

Proof:

From definition of Poisson RV, we can write

[TABLE]

∎

Proof:

We use the following theorem from [21] to de-possonize the estimator.

Theorem V.4

Assume a sequence $a_{n}$ is given, and its poisson transform is $F(Z)$ :

[TABLE]

Consider a linear cone $S_{\theta}=\left\{z:\left|\arg(z)\right|\leq\theta,\theta<\pi/2\right\}$ . Let the following conditions hold for some constants $R>0$ , $\alpha<1$ and $\beta\in\mathbb{R}$ :

•

For $z\in S_{\theta}$ ,

[TABLE]

•

For $z\notin S_{\theta}$ ,

[TABLE]

Then we have the following expansion that holds for every fixed $m$ :

[TABLE]

where $\sum_{ij}b_{ij}x^{i}y^{j}=\mathrm{exp}\!\left({x\log(1+y)-xy}\right)$ .

Let $\widehat{J}_{\alpha,k}(X,Y)$ and $\overline{J}_{\alpha,k}(X,Y)$ respectively represent the RVs $\widehat{J}_{\alpha}(X,Y)$ and $\overline{J}_{\alpha}(X,Y)$ with the parameter $k$ .

Using the dePoissonization theorem, we take $a_{k}:=\mathbb{E}\left[\widehat{J}_{\alpha,k}(X,Y)\right]$ and $F(k):=\mathbb{E}\left[\overline{J}_{\alpha,k}(X,Y)\right]$ . Since we are only interested in the values of $k$ , for which $\lim_{N\to\infty}\frac{k}{N}=0$ , we can assume $F(z)=O(1)$ . So, both the first and second conditions of the Theorem V.4 are satisfied. Then from (56), for $m=1$ :

[TABLE]

where $\beta=0$ .

∎

Finally at the end of this section, we mention that the bias proof for $\widehat{D}_{g}(X,Y)$ is pretty similar to the bias proof of $\widehat{D}_{g}(X,Y)$ and simply follows by the same steps.

B. Ensemble Estimator

In this section we state the MSE proof of the ensemble estimator. Assume that the density functions are from the Hölder space $\Sigma(\gamma,L)$ , which consists of those functions on $\mathcal{X}$ having continuous derivatives up to order $q$ and the $q$ th partial derivatives are Hölder continuous with exponent $\gamma^{\prime}$ , where $q:=\left\lfloor\gamma\right\rfloor$ and $\gamma^{\prime}:=\gamma-q$ . We first compute the bias of interior points, by providing the following lemma.

Lemma V.5

For a constant parameter $\kappa\in\mathbb{N}$ , let define $\mathcal{X_{I}^{\kappa}}:=\left\{x|x\in\mathcal{X},\alpha_{\kappa}(x)=1\right\}$ and $\mathcal{X_{B}^{\kappa}}:=\left\{x|x\in\mathcal{X},\alpha_{\kappa}(x)<1\right\}$ . Then for any point $Y_{1}\in\mathcal{X}$ and any $k\leq\kappa$ we have

[TABLE]

where $v$ is a constant defined in Lemma III.4 and $\theta_{\gamma}(Y_{1})$ is given by

[TABLE] where

a_i(Y_1) $areconstantsdependingon$ Y_1.

Proof:

Suppose that the density $f$ is $q$ times differentiable, and all of the $q$ derivatives are bounded. Let $y=Q_{k}(x)$ . Also let $r=\rho_{k}(x)$ , where $\rho_{k}(x)$ is defined as the $k$ -NN distance on the point $x$ . We can write $y=x+u\rho_{k}(x)$ , where $u$ is unit vector. Then the Taylor expansion of $f(y)$ around $f(x)$ is as follows

[TABLE]

So we apply Lemma V.3 with the following choices for $\xi_{j}(Y_{1})$ , $j\in\left\{1,2\right\}$

[TABLE]

which results in

[TABLE]

For the interior points, after simplifying $\tau_{i}(Y_{1})$ given in equation (V.3), and using (26) we get

[TABLE]

where $a_{i}(Y_{1})$ is a constant depending only on $Y_{1}$ .

For boundary points, by using a result in **[16]**(Bias Proof), we can bound the densities and get the desired upper bound. According to this result, for any $x\in\mathcal{X_{B}}$ and any $|i|<\gamma$ , we have

[TABLE]

where $h$ is the distance from $x$ to the boundary, and $L$ is a constant. Now note that since $\alpha_{\kappa}(x)<1$ and the $\kappa$ -NN ball meets the boundary, we have $h<\rho_{\kappa}(x)$ . Therefore, using the triangle inequality for (59) and setting $k=\kappa$ , for every point $y=Q_{\kappa}(x)\in\mathcal{X}$ we have

[TABLE]

where in the third line we have used (63) and the fact that $h<\rho_{\kappa}(x)$ . Using the bound on $k$ -NN distances for the boundary points derived in Lemma V.2, we have $\mathbb{E}\left[\rho^{q}_{\kappa}(x)\right]<O\!\left({(\kappa/N)^{q/d}}\right)$ . After simplifying $\tau_{i}(Y_{1})$ given in equation (V.3), we get

[TABLE]

The rest of the proof for both interior and boundary points follows similarly by replacing $O\!\left({\epsilon_{\gamma,k}}\right)$ by $\tau_{1}(Y_{1})+\tau_{2}(Y_{1})$ in (III.3), and finally we get a result similar to (III).

∎

Lemma V.6

The bias of the estimator can be derived as follows

[TABLE]

Proof:

Let define the notations $P_{\kappa,N}:=\Pr\!\left({Y_{1}\in\mathcal{X_{I}^{\kappa}}}\right)$ , $\bar{P}_{\kappa,N}:=1-P_{\kappa,N}$ and $\phi_{i,\kappa}(N):=P_{\kappa,N}\mathbb{E}\left[a_{i}(Y_{1})|Y_{1}\in\mathcal{X_{I}^{\kappa}}\right]$ . Using Lemma V.5 we have

[TABLE]

Using equations (13) and (III) concludes the bias rate for $D_{\alpha}(X,Y)$ . ∎

Proof:

The proof follows by using the ensemble theorem in (**[10]**, Theorem 4) with the parameters $\psi_{i}(l)=l^{i/d}$ and $\phi^{\prime}_{i,d}(N)=\phi_{i,\kappa}(N)/N^{i/d}$ .

∎

B. Variance Proof

Proof:

First note that the variance proof for $Y_{i}=\!\left({N_{i}/(M_{i}+1)}\right)^{\alpha}$ and $\widehat{J}_{\alpha}(X,Y)$ is contained in the the proof for $\widehat{D}_{\alpha}(X,Y)$ , and also the proof for $\widehat{D}_{g}(X,Y)$ is similar to that. So, here we only focus on the variance proof of $\widehat{D}_{\alpha}(X,Y)$ .

Assume that we have two set of nodes $X_{i}$ , $1\leq i\leq N$ and $Y_{j}$ for $1\leq j\leq M$ . Without loss of generality, assume that $N<M$ . We consider the $M-N$ virtual random points $X_{N+1},...,X_{M}$ with the same distribution as $X_{i}$ , and define $Z_{i}:=(X_{i},Y_{i})$ . Now for using the Efron-Stein inequality on $Z:=(Z_{1},...,Z_{M})$ , we consider another independent copy of $Z$ as $Z^{\prime}:=(Z^{\prime}_{1},...,Z^{\prime}_{M})$ and define $Z^{(i)}:=(Z_{1},...,Z_{i-1},Z^{\prime}_{i},Z_{i+1},...,Z_{M})$ . Let $\widehat{D}_{\alpha}(Z):=\widehat{D}_{\alpha}(X,Y)$ and $\widehat{J}_{\alpha}(Z):=\widehat{J}_{\alpha}(X,Y)$ . Then, according to Efron-Stein inequality we have

[TABLE]

Using the Mean Value Theorem, and going back to the definition $\widehat{D}_{\alpha}(Z)=\frac{1}{\alpha-1}\log\!\left({\widehat{J}_{\alpha}(Z)}\right)$ , there exist some constant $C_{\alpha}$ , such that

[TABLE]

Therefore, we only need to bound the RHS of (69), which is also an upper bound for $\mathbb{V}\!\left[{\widehat{J}_{\alpha}(Z)}\right]$ .

[TABLE]

First, we give an upper bound on the first term in (Proof:), and the second term would be bounded similarly. Define

[TABLE]

Then we have

[TABLE]

where in the last line we used $\mathbb{E}\left[B_{\alpha,i}B_{\alpha,j}\right]=\mathbb{E}\left[B_{\alpha,i}\right]\mathbb{E}\left[B_{\alpha,j}\right]=\mathbb{E}\left[B_{\alpha,i}\right]^{2}$ for $i\neq j$ . Next, we only need to find bounds on $\mathbb{E}\left[B_{\alpha,2}\right]$ and $\mathbb{E}\left[B_{\alpha,2}^{2}\right]$ . In the following lemma we derive the essential bounds.

Lemma V.7

$\mathbb{E}\left[B_{\alpha,2}\right]$ * and $\mathbb{E}\left[B_{\alpha,2}^{2}\right]$ satisfy the following relations:*

[TABLE]

∎

Proof:

The proof is similar for $\mathbb{E}\left[B_{\alpha,2}\right]$ and $\mathbb{E}\left[B^{2}_{\alpha,2}\right]$ . So here we only focus on $\mathbb{E}\left[B_{\alpha,2}\right]$ . We can assume that we re-sample $X$ and $Y$ separately, and both of the events are similar. Let $B^{X}_{\alpha,2}$ and $B^{Y}_{\alpha,2}$ denote the re-sampling difference in (71) when we only re-sample either $X_{1}$ or $Y_{1}$ points, respectively. Then it is easy to show that $\mathbb{E}\left[B_{\alpha,2}\right]\leq\mathbb{E}\left[B^{X}_{\alpha,2}\right]+\mathbb{E}\left[B^{Y}_{\alpha,2}\right]$ .

Considering the re-sampling of $X_{1}$ , we can write

[TABLE]

where $E_{00}$ is the event that none of $X_{1}$ and $X^{\prime}_{1}$ fall within $k$ nearest neighbor points of $X_{2}$ , $E_{01}$ is the event that $X_{1}$ and $X^{\prime}_{1}$ fall within and not among the $k$ nearest neighbor points of $X_{2}$ , respectively, $E_{10}$ is the event that $X^{\prime}_{1}$ and $X_{1}$ fall within and not among the $k$ nearest neighbor points of $X_{2}$ , respectively, and finally $E_{11}$ is the event that both of $X_{1}$ and $X^{\prime}_{1}$ fall within $k$ nearest neighbor points of $X_{2}$ . Now Note that in both of the events of $E_{00}$ and $E_{11}$ we have $B_{\alpha,2}=0$ . Also since the events $E_{01}$ and $E_{10}$ are symmetric, we only consider the event $E_{01}$ :

[TABLE]

Going back to (74), we have

[TABLE]

By using Taylor expansion, there exist a constant $e_{1}$ such that

[TABLE]

Note that $N_{2}^{\alpha-1}/(M_{2}+1)^{\alpha-1}$ is bounded from above by $(C_{U}/C_{L})^{\alpha-1}$ . Also from (III) we get $\mathbb{E}\left[(M_{2}+1)^{-1}\right]=O\!\left({1/k}\right)$ . Thus,

[TABLE]

We can similarly show that

[TABLE]

So, as result (Proof:) becomes

[TABLE]

Using a similar approach one can simply show that $\mathbb{E}\left[B^{Y}_{\alpha,2}\right]=O(1/N)$ . So, finally we have $\mathbb{E}\left[B_{\alpha,2}\right]=O(1/N)$ . ∎

From (Proof:) and Lemma V.7 we get

[TABLE]

Using a similar approach, we can also simply show that

[TABLE]

Finally using (68), (69) and (Proof:) we get

[TABLE]

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] T. M. Cover and J. A. Thomas, Elements of information theory . John Wiley & Sons, 2012.
2[2] A. Rényi, “On measures of entropy and information,” in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1 , pp. 547–561, University of California Press, 1961.
3[3] S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence of one distribution from another,” Journal of the Royal Statistical Society. Series B (Methodological) , pp. 131–142, 1966.
4[4] K. R. Moon and A. O. Hero, “Ensemble estimation of multivariate f-divergence,” in Information Theory (ISIT), 2014 IEEE International Symposium on , pp. 356–360, IEEE, 2014.
5[5] K. R. Moon, M. Noshad, S. Y. Sekeh, and A. O. Hero III, “Information theoretic structure learning with confidence,” in Proc IEEE Int Conf Acoust Speech Signal Process , 2017.
6[6] J. Beardwood, J. H. Halton, and J. M. Hammersley, “The shortest path through many points,” in Math Proc Cambridge , vol. 55, pp. 299–327, Cambridge Univ Press, 1959.
7[7] A. O. Hero, J. Costa, and B. Ma, “Asymptotic relations between minimal graphs and alpha-entropy,” Comm. and Sig. Proc. Lab.(CSPL), Dept. EECS, University of Michigan, Ann Arbor, Tech. Rep , vol. 334, 2003.
8[8] J. H. Friedman and L. C. Rafsky, “Multivariate generalizations of the wald-wolfowitz and smirnov two-sample tests,” The Annals of Statistics , pp. 697–717, 1979.