On the Complexity of Estimating Renyi Divergences

Maciej Skorski

arXiv:1702.01666·cs.IT·February 9, 2017

On the Complexity of Estimating Renyi Divergences

Maciej Skorski

PDF

Open Access

TL;DR

This paper investigates the difficulty of estimating Renyi divergences between distributions, revealing that sample complexity depends heavily on rare events and can be unbounded, especially for divergence orders greater than one.

Contribution

The paper extends previous work on Renyi entropy estimation by providing new bounds and techniques, highlighting the dependence of sample complexity on small probability events.

Findings

01

Sample complexity is unbounded for small probability events.

02

For divergence order > 1, bounds depend on probabilities of p and q.

03

Worst-case complexity is polynomial only when q's probabilities are non-negligible.

Abstract

This paper studies the complexity of estimating Renyi divergences of discrete distributions: $p$ observed from samples and the baseline distribution $q$ known \emph{a priori}. Extending the results of Acharya et al. (SODA'15) on estimating Renyi entropy, we present improved estimation techniques together with upper and lower bounds on the sample complexity. We show that, contrarily to estimating Renyi entropy where a sublinear (in the alphabet size) number of samples suffices, the sample complexity is heavily dependent on \emph{events occurring unlikely} in $q$ , and is unbounded in general (no matter what an estimation technique is used). For any divergence of order bigger than $1$ , we provide upper and lower bounds on the number of samples dependent on probabilities of $p$ and $q$ . We conclude that the worst-case sample complexity is polynomial in the alphabet size if and only if the…

Tables1

Table 1. TABLE I: A brief summary of our results, for the problem of estimating the Renyi divergence D α ( p ∥ q ) subscript 𝐷 𝛼 conditional 𝑝 𝑞 D_{\alpha}(p\parallel q) (where the divergence parameter α > 1 𝛼 1 \alpha>1 is a fixed constant) between the known baseline distribution q 𝑞 q and a distribution p 𝑝 p learned from samples, both over an alphabet of size k 𝑘 k . The complexity is the number of samples needed to estimate the divergence up to a constant error and with success probability at least 2 3 2 3 \frac{2}{3} .

Assumption	Complexity	Comment	Reference
$\min_{i} q_{i} = Θ (k^{- 1}) = \max_{i} q_{i}$	$O (k^{1 - \frac{1}{α}})$	almost uniform $q$ , complexity sublinear	Corollary 1
no assumptions	$Ω (k^{\frac{1}{2}})$	complexity at least square root	Corollary 3
$\min_{i} q_{i} = k^{- ω (1)}$	$k^{ω (1)}$	negligible masses in $q$ , super-polynomial complexity	Corollary 4
$\min_{i} q_{i} = k^{- O (1)}$	$k^{O (1)}$	non-negligible mass in $q$ , polynomial complexity	Corollary 2

Equations53

D_{α} (p ∥ q) = \frac{1}{α - 1} lo g i \sum \frac{p _{i}^{α}}{q _{i}^{α - 1}}

D_{α} (p ∥ q) = \frac{1}{α - 1} lo g i \sum \frac{p _{i}^{α}}{q _{i}^{α - 1}}

- \frac{1}{α - 1} lo g i \sum p_{i}^{α} = - D_{α} (p ∥ q_{A}) + lo g ∣ A ∣,

- \frac{1}{α - 1} lo g i \sum p_{i}^{α} = - D_{α} (p ∥ q_{A}) + lo g ∣ A ∣,

x_{i} \leftarrow p Pr [∣ Est^{q} (x_{1}, \dots, x_{n}) - D_{α} (p ∥ q) ∣ > δ] < ϵ .

x_{i} \leftarrow p Pr [∣ Est^{q} (x_{1}, \dots, x_{n}) - D_{α} (p ∥ q) ∣ > δ] < ϵ .

M_{α} (p, q) = d e f e^{(1 - α) D_{α} (p ∥ q) =} = i \sum \frac{p _{i}^{α}}{q _{i}^{α - 1}}

M_{α} (p, q) = d e f e^{(1 - α) D_{α} (p ∥ q) =} = i \sum \frac{p _{i}^{α}}{q _{i}^{α - 1}}

r = 0 \sum α - 1 (r α) \frac{1}{n ^{α - r}} \frac{\sum _{i} \frac{p _{i}^{α + r}}{q _{i}^{2 α - 2}}}{( \sum _{i} \frac{p _{i}^{α}}{q _{i}^{α - 1}} ) ^{2}} ≪ ϵ δ^{2},

r = 0 \sum α - 1 (r α) \frac{1}{n ^{α - r}} \frac{\sum _{i} \frac{p _{i}^{α + r}}{q _{i}^{2 α - 2}}}{( \sum _{i} \frac{p _{i}^{α}}{q _{i}^{α - 1}} ) ^{2}} ≪ ϵ δ^{2},

\frac{\sum _{i} \frac{p _{i}^{α + r}}{q _{i}^{2 α - 2}}}{( \sum _{i} \frac{p _{i}^{α}}{q _{i}^{α - 1}} ) ^{2}} = k^{O (1)} \cdot \frac{\sum _{i} p _{i}^{α + r}}{( \sum _{i} p _{i}^{α} ) ^{2}} .

\frac{\sum _{i} \frac{p _{i}^{α + r}}{q _{i}^{2 α - 2}}}{( \sum _{i} \frac{p _{i}^{α}}{q _{i}^{α - 1}} ) ^{2}} = k^{O (1)} \cdot \frac{\sum _{i} p _{i}^{α + r}}{( \sum _{i} p _{i}^{α} ) ^{2}} .

k^{O (1)} \cdot r = 0 \sum α - 1 (r α) \frac{1}{n ^{α - r}} \frac{\sum _{i} p _{i}^{α + r}}{( \sum _{i} p _{i}^{α} ) ^{2}} < 0.3.

k^{O (1)} \cdot r = 0 \sum α - 1 (r α) \frac{1}{n ^{α - r}} \frac{\sum _{i} p _{i}^{α + r}}{( \sum _{i} p _{i}^{α} ) ^{2}} < 0.3.

k^{O (1)} \cdot r = 0 \sum α - 1 (r α) (\frac{n}{k ^{\frac{α - 1}{α}}})^{r - α} < 0.3,

k^{O (1)} \cdot r = 0 \sum α - 1 (r α) (\frac{n}{k ^{\frac{α - 1}{α}}})^{r - α} < 0.3,

k^{O (1)} \cdot ((1 + \frac{k ^{\frac{α - 1}{α}}}{n})^{α} - 1) < 0.3

k^{O (1)} \cdot ((1 + \frac{k ^{\frac{α - 1}{α}}}{n})^{α} - 1) < 0.3

k^{O (1)} \cdot α \cdot \frac{k ^{\frac{α - 1}{α}}}{n} < 0.3.

k^{O (1)} \cdot α \cdot \frac{k ^{\frac{α - 1}{α}}}{n} < 0.3.

C_{1}

C_{1}

C_{2}

n = Ω (max (C_{2}, C_{1}))

n = Ω (max (C_{2}, C_{1}))

\displaystyle\frac{p_{i}^{\alpha}}{q_{i}^{\alpha-1}}=\left\{\begin{array}[]{cc}O(k^{-1})&i\not=i_{0}\\ k^{c(\alpha-1)-d\alpha}&i=i_{0}\end{array}\right.

\displaystyle\frac{p_{i}^{\alpha}}{q_{i}^{\alpha-1}}=\left\{\begin{array}[]{cc}O(k^{-1})&i\not=i_{0}\\ k^{c(\alpha-1)-d\alpha}&i=i_{0}\end{array}\right.

i max \frac{p ^{α - 2}}{q _{i}^{α - 1}}

i max \frac{p ^{α - 2}}{q _{i}^{α - 1}}

c (α - 1) - d α

c (α - 1) - d α

c (α - 1) - d (α - 2)

n = Ω (k^{d}) .

n = Ω (k^{d}) .

M_{α}^{Est} (p, q) = \frac{1}{n ^{α}} i \sum (n \overset{p_{i}}{^})^{\underline{α}} q_{i}^{1 - α}

M_{α}^{Est} (p, q) = \frac{1}{n ^{α}} i \sum (n \overset{p_{i}}{^})^{\underline{α}} q_{i}^{1 - α}

V a r [i \sum q_{i}^{1 - α} \frac{n _{i}^{\underline{α}}}{n ^{α}}]

V a r [i \sum q_{i}^{1 - α} \frac{n _{i}^{\underline{α}}}{n ^{α}}]

⩽ i \sum \frac{n _{i}^{α} ( ( n _{i} + α ) ^{α} - n _{i}^{α} )}{q _{i}^{2 α - 2} n ^{2 α}}

= r = 0 \sum α - 1 (r α) \frac{1}{n ^{α - r}} i \sum \frac{p _{i}^{α + r}}{q _{i}^{2 α - 2}} .

f ai l u r e = 1 - δ \frac{\sum _{i} q _{i}^{1 - α}}{M _{α} ( p , q )} ⩽ 1 + δ

f ai l u r e = 1 - δ \frac{\sum _{i} q _{i}^{1 - α}}{M _{α} ( p , q )} ⩽ 1 + δ

Pr [f ai l u r e]

Pr [f ai l u r e]

⩽ δ^{- 2} \frac{1}{q _{j}^{α - 1}} r = 0 \sum α - 1 (r α) \frac{1}{n ^{α - r}} (i \sum \frac{p _{i}^{α}}{q _{j}^{α - 1}})^{- \frac{α - r}{α}}

M (p + δ, q)

M (p + δ, q)

+ α (α - 1) i \sum δ_{i}^{2} (p_{i} + min (0, δ_{i}))^{α - 2} q_{i}^{1 - α} .

M (p + δ, q)

M (p + δ, q)

+ \frac{1}{4} α (α - 1) i \sum δ_{i}^{2} p_{i}^{α - 2} q_{i}^{1 - α} .

M (p^{'}, q) ⩾ M (p, q) (1 + C_{1} δ + C_{2} δ^{2}) .

M (p^{'}, q) ⩾ M (p, q) (1 + C_{1} δ + C_{2} δ^{2}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Mechanics and Entropy · Statistical Distribution Estimation and Applications · Statistical Methods and Inference

Full text

On the Complexity of Estimating Renyi Divergences

Maciej Skorski Supported by the European Research Council consolidator grant (682815-TOCNeT). IST Austria

Email: [email protected]

Abstract

This paper studies the complexity of estimating Renyi divergences of discrete distributions: $p$ observed from samples and the baseline distribution $q$ known a priori. Extending the results of Acharya et al. (SODA’15) on estimating Renyi entropy, we present improved estimation techniques together with upper and lower bounds on the sample complexity.

We show that, contrarily to estimating Renyi entropy where a sublinear (in the alphabet size) number of samples suffices, the sample complexity is heavily dependent on events occurring unlikely in $q$ , and is unbounded in general (no matter what an estimation technique is used). For any divergence of order bigger than $1$ , we provide upper and lower bounds on the number of samples dependent on probabilities of $p$ and $q$ . We conclude that the worst-case sample complexity is polynomial in the alphabet size if and only if the probabilities of $q$ are non-negligible.

This gives theoretical insights into heuristics used in applied papers to handle numerical instability, which occurs for small probabilities of $q$ . Our result explains that small probabilities should be handled with care not only because of numerical issues, but also because of a blow up in sample complexity.

Keywords:

Renyi divergence, sampling complexity, anomaly detection

I Introduction

I-A Renyi Divergences in Anomaly Detection

A popular statistical approach to detect anomalies in real-time data is to compare the empirical distribution of certain features (updated on the fly) against a stored “profile” (learned from past observations or computed off-line) used as a reference distribution. Significant deviations of the observed distribution from the assumed profile trigger an alarm [GMT05].

This technique, among many other applications, is often used to detect DDoS attacks in network traffic [GCFJP+15, PKY15]. To quantify the deviation between the actual data and the reference distribution, one needs to employ a suitable dissimilarity metric. In this context, based on empirical studies, Renyi divergences were suggested as good dissimilarity measures [LZY09, XLZ11, BBK15, GCFJP+15, PKY15].

While the divergence can be evaluated based on theoretical models111 For example, one uses fractional Brownian motions to simulate real network traffic and Poisson distributions to model DDoS traffic[XLZ11]., much more important (especially for real-time detection) is the estimation on the basis of samples. The related literature is focused mainly on tunning the performance of specific implementations, by choosing appropriate parameters (such as the suitable definition or the sampling frequency) based on empirical evidence. On the other hand, not much is known about the theoretical performance of estimating Renyi divergences for general discrete distributions (continuous distributions need extra smoothness assumptions [SP14]). A limited case is estimating Renyi entropy [AOST15] which corresponds to the uniform reference distribution.

In this paper, we attempt to fill the gap by providing better estimators for the Renyi divergence, together with theoretical guarantees on the performance. In our approach, motivated by mentioned applications to anomaly detection, we assume that the reference distribution $q$ is explicitly known and the other distribution $p$ can only be observed from i.i.d. samples.

I-B Our Contribution and Related Works

Better estimators for a-priori known reference distributions

In the literature Renyi divergences are typically estimated by straightforward plug-in estimators (see [LZY09, BBK15, LZY09, XLZ11, BBK15, GCFJP+15, PKY15]). In this approach, one puts the empirical distribution (estimated from samples) into the divergence formula, in place of the true distribution. Unfortunately, they have worse statistical properties, e.g. are heavily biased. This affects the number of samples required to get a reliable estimate.

To obtain reliable estimates within a possible small number of samples, we extend the techniques from [AOST15]. The key idea is to use falling powers to estimate power sums of a distribution (this trick is in fact a bias correction method). The estimator is illustrated in Algorithm 1 below.

For certain cases (where the reference distribution is close to uniform) we estimate the divergence with the number of samples sublinear in the alphabet size, whereas plug-in estimators need a superlinear number of samples. In particular for the uniform reference distribution $q$ , we recover the same upper bounds for estimating Renyi entropy as in [AOST15].

Upper and lower bounds on the sample complexity

We show that the sample complexity of estimating divergence of unknown $p$ observed from samples to an explicitly known $q$ is dependent on the reference distribution $q$ itself. When $q$ doesn’t take too small probabilities, non-trivial estimation is possible, even sublinear in the alphabet size for any $p$ . However when $q$ takes arbitrarily small values, the complexity is dependent on inverse powers of probability masses of $p$ and is * unbounded* (for a fixed alphabet), without extra assumptions on $p$ . We stress that these lower bounds are no-go results independent of the estimation technique. For a more quantitative comparison, see Table I.

Complexity instability vs numerical instability

Our results provide theoretical insights about heuristic “patches” to the Renyi divergence formula suggested in the applied literature. Since the formula is numerically unstable when one of the probability masses $q_{i}$ becomes arbitrarily small (see Definition 2), authors suggested to omit or round up very small probabilities of $q$ (see for example [LZY09, PKY15]).

In accordance to this, as shown in Table I, the sample complexity is also unstable when unlike events occur in the reference distribution $q$ . Moreover, this is the case even if the distribution $q$ is perfectly known. We therefore conclude that small probabilities of $q$ are very subtle not only because of numerical instability, but more importantly because the sample complexity is unstable.

I-C Our techniques

For upper bounds we merely borrow and extend techniques from [AOST15]. For lower bounds our approach is however different. We find a pair of distributions which are close in total variation yet with much different divergences to $q$ , by a variational approach (writing down an explicit optimization program) As a result, we can obtain our lower bounds for any accuracy. In turn, the argument in [AOST15], even if can be extended to the Renyi divergence, has inherit limitations that make it work only for sufficiently small accuracies. Thus we can say that our lower bound technique, in comparison to [AOST15], offers lower bounds valid in all regimes of the accuracy parameter, in particular for constant values used in the applied literature.

In fact, our technique strictly improves known lower bounds on estimating collision entropy. Taking the special case when $q$ is uniform, we obtain that the sample complexity for estimating collision entropy is $\Omega(k^{\frac{1}{2}})$ even for constant accuracy, while results in [AOST15] guarantees this only for very small $\delta$ (no exact threshold is given, and hidden constants may be dependent on $\delta$ ), which is captured by the notation $\tilde{\tilde{\Omega}}(k^{\frac{1}{2}})$ .

I-D Organization

In Section II we introduce necessary notions and notations. Upper bounds on the sample complexity are discussed in Section III and lower bounds in Section IV. We conclude our work in Section V.

II Preliminaries

For a distribution $p$ over an alphabet $\mathcal{A}=\{a_{1},\ldots,a_{k}\}$ we denote $p_{i}=p(a_{i})$ . All logarithms are at base $2$ .

Definition 1 (Total variation).

The total variation of two distributions $p,p^{\prime}$ over the same finite alphabet equals $d_{TV}(p,p^{\prime})=\frac{1}{2}\sum_{i}|p_{i}-p^{\prime}_{i}|$ .

Below we recall the definition of Renyi divergence (we refer the reader to [EH14] for a survey of its properties).

Definition 2 (Renyi divergence).

The Renyi divergence of order $\alpha$ (in short: Renyi $\alpha$ -divergence) of two distributions $p,q$ having the same support is defined by

[TABLE]

By setting uniform $q$ we get the relation to Renyi entropy.

Remark 1 (Renyi entropy vs Renyi divergence).

For any $p$ over $\mathcal{A}$ the Renyi entropy of order $\alpha$ equals

[TABLE]

where $q_{\mathcal{A}}$ is the uniform distribution over $\mathcal{A}$ .

Definition 3 (Renyi’s divergence estimation).

Fix an alphabet $\mathcal{A}$ of size $k$ , and two distributions $p$ and $q$ over $\mathcal{A}$ . Let $\mathsf{Est}^{q}:\mathcal{A}^{n}\rightarrow\mathbb{R}$ be an algorithm which receives $n$ independent samples of $p$ on its input. We say that $\mathsf{Est}^{q}$ provides an additive $(\delta,\epsilon)$ -approximation to the Renyi $\alpha$ -divergence of $p$ from $q$ if

[TABLE]

Definition 4 (Renyi’s divergence estimation complexity).

The sample complexity of estimating the Renyi divergence given $q$ with probability error $\epsilon$ and additive accuracy $\delta$ is the minimal number $n$ for which there exists an algorithm satisfying Equation 2 for all $p$ .

It turns out that it is very convenient not to work directly with estimators for Renyi divergence, but rather with estimators for weighted power sums.

Definition 5 (Divergence power sums).

The power sum corresponding to the $\alpha$ divergence of $p$ and $q$ is defined as

[TABLE]

The following lemma shows that estimating divergences (Equation 1) with an absolute relative error of $O(\delta)$ and corresponding power sums (Equation 3) with a relative error of $O(\delta/(\alpha-1))$ is equivalent

Lemma 1 (Equivalence of Additive and Multiplicative Estimations).

Suppose that $m$ is a number such that $M_{\alpha}(p,q)=m\cdot(1+\delta)$ , where $|\delta|<\frac{1}{2}$ . Then $d=-\frac{1}{\alpha-1}\log m$ satisfies $D_{\alpha}(p\parallel q)=d+O(1/(\alpha-1))\cdot\delta$ . The other way around, if $m^{\prime}$ is such that $D_{\alpha}(p\parallel q)=d+\delta$ , where $|\delta|<\frac{1}{2}$ , then $m=\mathrm{e}^{(1-\alpha)d}$ satisfies $M_{\alpha}(p,q)=m\cdot(1+O(\alpha-1)\cdot\delta)$ .

The proof is a straightforward consequence of the first order Taylor’s approximation, and will appear in the full version.

III Upper Bounds on the Sample Complexity

Below we state our upper bounds for the sample complexity. The result is very similar to the formula in [AOST15] before simplifications, except the fact that in our statement there are additional weights coming from possibly non-uniform $q$ and it can’t be further simplified.

Theorem 1 (Generalizing [AOST15]).

For any distributions $p,q$ over an alphabet of size $k$ , if the number $n$ satisfies

[TABLE]

then the complexity of estimating the Renyi $\alpha$ -divergence of $p$ to given $q$ is at most $n$ .

The proof is deferred to the appendix, below we discuss corollaries. The first corollary shows that the complexity is sublinear when the reference distribution is close to uniform.

Corollary 1 (Sublinear complexity for almost uniform reference probabilities, extending [AOST15]).

Let $p,q$ be distributions over an alphabet of size $k$ , and $\alpha>1$ be a constant. Suppose that $\max_{i}q_{i}=O(k^{-1})$ and $\min_{i}q_{i}=\Omega(k^{-1})$ . Then the complexity of estimating the Renyi $\alpha$ -divergence with respect to $q$ , up to constant accuracy and probability error at most $\frac{1}{3}$ , is $O\left(k^{\frac{\alpha-1}{\alpha}}\right)$ .

As shown in the next corollary, the complexity is polynomial only if the reference probabilities are not negligible.

Corollary 2 (Polynomial complexity for non-negligible reference probabilities).

Let $p,q$ be distributions over an alphabet of size $k$ . Suppose that $\min_{i}{q_{i}}=k^{-O(1)}$ , and let $\alpha>1$ be a constant. Then the complexity of estimating the Renyi $\alpha$ -divergence with respect to $q$ , up to a constant accuracy and probability error at most $0.3$ (in the sense of Definition 4) is $k^{O(1)}$ .

Proof of Corollary 2.

Under our assumptions $\sum_{i}\frac{p_{i}^{\alpha+r}}{q_{i}^{2\alpha-2}}=k^{O(1)}\cdot\sum_{i}p_{i}^{\alpha+r}$ . Since $q_{i}\leqslant 1$ , we get $\sum_{i}\frac{p_{i}^{\alpha}}{q_{i}^{\alpha-1}}\leqslant\sum_{i}p_{i}^{\alpha}$ . By Theorem 1, we conclude that the sufficient condition is

[TABLE]

Therefore, we need to chose $n$ such that

[TABLE]

By the discussion in [AOST15] we know that for $r=0,\ldots,\alpha-1$ we have $\frac{\sum_{i}p_{i}^{\alpha+r}}{\left(\sum_{i}p_{i}^{\alpha}\right)^{2}}\leqslant k^{(\alpha-1)\cdot\frac{\alpha-r}{\alpha}}$ . Thus we need to find $n$ that satisfies

[TABLE]

By the inequality $\sum_{j\geqslant 0}\binom{\beta}{j}u^{j}\leqslant(1+u)^{\beta}$ (which follows by the Taylor’s expansion for any positive real number $\beta$ ) and the symmetry of binomial coefficients we need

[TABLE]

By the Taylor expansion $(1+u)^{\alpha}=1+O(\alpha u)$ valid for $u\leqslant\frac{1}{\alpha}$ it suffices if

[TABLE]

which finishes the proof. ∎

Proof of Corollary 1.

The corollary can be concluded by inspecting the proof of Corollary 2. The bounds are the same except that the factor $k^{O(1)}$ is replaced by $\Theta(1)^{\alpha}$ . For constant $\alpha$ , the final condition reduces to $n\geqslant\Omega\left(k^{\frac{\alpha-1}{\alpha}}\right)$ . ∎

IV Sample Complexity Lower Bounds

The following theorem provides lower bounds on the sample complexity for any distribution $p$ and $q$ . Since the statement is somewhat technical,we discuss only corollaries and refer to the appendix for a proof.

Theorem 2 (Sample Complexity Lower Bounds).

Let $p,q$ be two fixed distributions, $\delta\in(0,0.5)$ and numbers $C_{1},C_{2}\geqslant 0$ be given by

[TABLE]

for some $\delta_{i}$ satisfying $\delta_{i}\geqslant-\frac{1}{2}$ , $\sum_{i}\delta_{i}p_{i}=0$ , and $\sum_{i}p_{i}|\delta_{i}|=\delta$ . Then for any fixed $\alpha>1$ , estimating the Renyi divergence to $q$ (in the sense of Definition 3) with error probability $\frac{1}{3}$ and up to a constant accuracy requires is at least

[TABLE]

samples from $p$ .

By choosing appropriate numbers in Theorem 2 we can obtain bounds for different settings.

Corollary 3 (Lower bounds for general case).

Estimating the Renyi divergence requires always $\Omega\left(k^{\frac{1}{2}}\right)$ samples.

Proof of Corollary 3.

In Theorem 2 we chose the uniform $p$ and $\delta$ such that $\delta_{i}=\frac{k}{4}$ for the index $i=i_{0}$ minimizing $q_{i}$ , and $\delta_{i}=-\frac{k}{4(k-1)}$ elsewhere. This gives us $C_{1}\geqslant 0$ and $C_{2}\geqslant\Omega(k^{2})\cdot\frac{q_{i_{0}}^{1-\alpha}}{\sum_{i}q_{i}^{1-\alpha}}$ (the constant dependent on $\alpha$ ) which is bigger than $\Omega(k)$ , because $\frac{q_{i_{0}}^{1-\alpha}}{\sum_{i}q_{i}^{1-\alpha}}\geqslant k^{-1}$ by our choice of $i_{0}$ . ∎

Corollary 4 (Polynomial complexity requires non-negligible probability masses).

For sufficiently large $k$ , if $\min_{i}q_{i}=k^{-\omega(1)}$ then there exists a distribution $p$ dependent on $k$ such that estimation is at least $k^{\omega(1)}$ .

Proof of Corollary 4.

Fix one alphabet symbol $a_{i_{0}}$ and real positive numbers $c,d$ . Let $q$ put the probability $\frac{1}{k^{c}}$ on $x$ and be uniform elsewhere. Also let $p$ put the probability $\frac{1}{k^{d}}$ on $x$ and be uniform elsewhere. We have

[TABLE]

and

[TABLE]

Choose $d$ so that it satisfies

[TABLE]

for example $d=\frac{\alpha-1}{\alpha}\cdot c$ (works for $\alpha\geqslant 2$ and $1<\alpha<2$ ) we obtain from Theorem 2 (where we take $\delta_{i}=\frac{1}{2}$ for $i=i_{0}$ and constant $\delta_{i}$ elsewhere, and our conditions on $d$ ensure that $C_{1}\geqslant 0$ and $C_{2}\geqslant\Omega(k^{2d})$ respectively ) that for sufficiently large $k$ the minimal number of samples is

[TABLE]

Note that if $c=\omega(1)$ our choice of $d$ implies that also $d=\omega(1)$ , and thus the corollary follows. ∎

V Conclusion

We extended the techniques recently used to analyze the complexity of entropy estimation to the problem of estimating Renyi divergence. We showed that in general there are no uniform bounds on the sample complexity, and the complexity is polynomial in the alphabet size if and only if the reference distribution doesn’t take negligible probability masses (explained by the numerical properties of the divergence formula).

Appendix A Proof of Theorem 1

Proof of Theorem 1 (sketch).

We follow essentially the same proof strategy as in [AOST15], with the only difference that we estimate weighted power sums $\sum_{i}q_{i}^{1-\alpha}{p_{i}^{\alpha}}$ corresponding to the divergence, instead of sums $\sum_{i}p_{i}^{\alpha}$ corresponding to the entropy. Let $\hat{p_{i}}$ be the empirical frequency of the $i$ -th symbol in the stream $X_{1},\ldots,X_{n}$ . Consider the following estimator for $2^{(\alpha-1)D_{\alpha}(p,q)}$ .

[TABLE]

Note that this is precisely the power sum defined in Algorithm 1. By Lemma 1 it suffices to consider this estimator with the multiplicative error $O(\delta)$ (for constant $\alpha$ ).

In particular, we use the fact that we can randomize $n$ and make it a sample from the Poisson distribution of the same mean. This transformation doesn’t hurt the estimator convergence, but on the other hand makes the empirical frequencies independent (see [AOST15] for more details).

Under the Poisson sampling and with notations as in Algorithm 1 we arrive at the formula

[TABLE]

The next reduction is an observation that is suffices to construct an estimator that fails with probability at most $\frac{1}{3}$ , as the success probability can be then amplified by the median trick [AOST15]. In general, it is pretty standard in the literature to simply present estimators with constant error probability [CDGR16].

Let’s define the success event

[TABLE]

By Chebyszev’s Inequality we obtain the following bound

[TABLE]

(consistent with [AOST15] for uniform $q$ ) which finishes the proof. ∎

Appendix B Proof of Theorem 2

Proof of Theorem 2.

We can assume that $\max(C_{1},C_{2})\geqslant 1$ , as otherwise the estimate on $n$ is trivial. We start with the following lemma (a similar technique is used in [AOST15], our exposition is different)

Lemma 2.

Suppose that there exists a $(\delta,\epsilon)$ -estimator for the Renyi divergence as in Definition 3, which uses $n$ samples, where $\epsilon<\frac{1}{2}$ . Then the following is true: any two distributions $p,p^{\prime}$ that are $\frac{1-2\epsilon}{n}$ -close in total variation, must satisfy $|D_{\alpha}(p\parallel q)-D_{\alpha}(p^{\prime}\parallel q)|<2\delta$ .

Proof of Lemma 2.

The lemma follows by the following observation: if the estimator fails with probability at most $\epsilon$ on both distributions $p$ and $p^{\prime}$ , then one can build a distinguisher for an $n$ -fold products $p^{n}$ and $p^{\prime n}$ by comparing the algorithm outputs against the threshold $\frac{1}{2}\left(D_{\alpha}(p,q)+D_{\alpha}(p^{\prime},q)\right)$ . If $D_{\alpha}(p,q)-D_{\alpha}(p^{\prime},q)\geqslant 2\delta$ , this distinguisher works with advantage $1-2\epsilon$ in total variation. We complete the proof by the standard hybrid argument: if $n$ -fold products $p^{n}$ and $p^{\prime n}$ are away by $1-2\epsilon$ in total variation, then the distributions $p$ and $p^{\prime}$ must be $\frac{1-2\epsilon}{n}$ away. ∎

By combining this with Lemma 1, it suffices to prove that $M_{\alpha}(p^{\prime},q)\geqslant(1+\Omega(1))M_{\alpha}(p,q)$ for some $p,p^{\prime}$ that are close in total variation by $\frac{O(1)}{\max(\sqrt{C_{2}},C_{1})}$ .

Recall that $M_{\alpha}(p,q)=\sum_{i}{p_{i}^{\alpha}}{q_{i}^{1-\alpha}}$ . Consider any vector $\delta$ such that $\delta_{i}\geqslant-p_{i}$ and $\sum_{i}\delta_{i}=0$ (in particular, $p^{\prime}=p+\delta$ is a probability distribution). By the first order Taylor approximation

[TABLE]

Assuming that $\delta_{i}\geqslant-\frac{1}{2}p_{i}$ we obtain

[TABLE]

changing variables by $\delta_{i}=\delta^{\prime}_{i}p_{i}$ , denoting $p^{\prime}_{i}=p_{i}+p_{i}\delta^{\prime}_{i}$ and $\delta=\sum_{i}|\delta_{i}|$ gives us $p^{\prime}$ and $p$ that are $O(\delta)$ away in total variation and

[TABLE]

Consider now two cases. Assume first $C_{1}\geqslant 1$ . The inequality $M(p^{\prime},q)\geqslant M(p,q)(1+C_{1}\delta)$ implies an additive error of $\Omega(C_{1}\delta)$ in estimation. Note that $\delta$ can be scaled (by a factor smaller than 1, as $C_{1}>1$ ) so that $C_{1}\delta=\Omega(1)$ . The distance between $p$ and $p^{\prime}$ is then at least $O\left(\frac{1}{C_{1}}\right)$ . Suppose now that $C_{2}\geqslant 1$ . Similarly, by scaling $\delta$ (which is possible because we have $C_{2}>1$ ) we can arrive at $C_{2}\delta^{2}=\Omega(1)$ . Then the inequality $M(p^{\prime},q)\geqslant M(p,q)(1+C_{1}\delta^{2})$ yields an additive error $\Omega(1)$ in estimation, and the distance between $p$ and $p^{\prime}$ is $O\left(\sqrt{\frac{1}{C_{2}}}\right)$ . The bounds on $n$ follow, because by Lemma 2 we must have $\frac{1}{n}<O\left(\frac{1}{C_{1}}\right)$ or $\frac{1}{n}<O\left(\sqrt{\frac{1}{C_{2}}}\right)$ . ∎

Bibliography10

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[AOST 15] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh and Himanshu Tyagi “The Complexity of Estimating Rényi Entropy” In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015 , 2015, pp. 1855–1869 DOI: 10.1137/1.9781611973730.124 · doi ↗
2[BBK 15] Monowar H. Bhuyan, D. K. Bhattacharyya and Jugal K. Kalita “An empirical evaluation of information metrics for low-rate and high-rate D Do S attack detection” In Pattern Recognition Letters 51 , 2015, pp. 1–7 DOI: 10.1016/j.patrec.2014.07.019 · doi ↗
3[CDGR 16] Clément L. Canonne, Ilias Diakonikolas, Themis Gouleakis and Ronitt Rubinfeld “Testing Shape Restrictions of Discrete Distributions” In 33rd Symposium on Theoretical Aspects of Computer Science, STACS 2016, February 17-20, 2016, Orléans, France , 2016, pp. 25:1–25:14 DOI: 10.4230/LIP Ics.STACS.2016.25 · doi ↗
4[EH 14] Tim Erven and Peter Harremoës “Rényi Divergence and Kullback-Leibler Divergence” In IEEE Trans. Information Theory 60.7 , 2014, pp. 3797–3820 DOI: 10.1109/TIT.2014.2320500 · doi ↗
5[GCFJP+15] Vincenzo Gulisano, Mar Callau-Zori, Zhang Fu, Ricardo Jiménez-Peris, Marina Papatriantafilou and Marta Patiño-Martínez “STONE: A streaming D Do S defense framework” In Expert Syst. Appl. 42.24 , 2015, pp. 9620–9633 DOI: 10.1016/j.eswa.2015.07.027 · doi ↗
6[GMT 05] Yu Gu, Andrew Mc Callum and Don Towsley “Detecting Anomalies in Network Traffic Using Maximum Entropy Estimation” In Proceedings of the 5th ACM SIGCOMM Conference on Internet Measurement , IMC ’05 Berkeley, CA, USA: USENIX Association, 2005, pp. 32–32 URL: http://dl.acm.org/citation.cfm?id=1251086.1251118
7[LZY 09] Ke Li, Wanlei Zhou and Shui Yu “Effective metric for detecting distributed denial-of-service attacks based on information divergence” In IET Communications 3.12 , 2009, pp. 1851–1860 DOI: 10.1049/iet-com.2008.0586 · doi ↗
8[PKY 15] Sirikarn Pukkawanna, Youki Kadobayashi and Suguru Yamaguchi “Network-based mimicry anomaly detection using divergence measures” In International Symposium on Networks, Computers and Communications, ISNCC 2015, Yasmine Hammamet, Tunisia, May 13-15, 2015 , 2015, pp. 1–7 DOI: 10.1109/ISNCC.2015.7238570 · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

On the Complexity of Estimating Renyi Divergences

Abstract

Keywords:

I Introduction

I-A Renyi Divergences in Anomaly Detection

I-B Our Contribution and Related Works

Better estimators for a-priori known reference distributions

Upper and lower bounds on the sample complexity

Complexity instability vs numerical instability

I-C Our techniques

I-D Organization

II Preliminaries

Definition 1** (Total variation).**

Definition 2** (Renyi divergence).**

Remark 1** (Renyi entropy vs Renyi divergence).**

Definition 3** (Renyi’s divergence estimation).**

Definition 4** (Renyi’s divergence estimation complexity).**

Definition 5** (Divergence power sums).**

Lemma 1** (Equivalence of Additive and Multiplicative Estimations).**

III Upper Bounds on the Sample Complexity

Theorem 1** (Generalizing [AOST15]).**

Corollary 1** (Sublinear complexity for almost uniform reference probabilities, extending [AOST15]).**

Corollary 2** (Polynomial complexity for non-negligible reference probabilities).**

Proof of Corollary 2.

Proof of Corollary 1.

IV Sample Complexity Lower Bounds

Theorem 2** (Sample Complexity Lower Bounds).**

Corollary 3** (Lower bounds for general case).**

Proof of Corollary 3.

Corollary 4** (Polynomial complexity requires non-negligible probability masses).**

Proof of Corollary 4.

V Conclusion

Appendix A Proof of Theorem 1

Proof of Theorem 1 (sketch).

Appendix B Proof of Theorem 2

Proof of Theorem 2.

Lemma 2**.**

Proof of Lemma 2.

Definition 1 (Total variation).

Definition 2 (Renyi divergence).

Remark 1 (Renyi entropy vs Renyi divergence).

Definition 3 (Renyi’s divergence estimation).

Definition 4 (Renyi’s divergence estimation complexity).

Definition 5 (Divergence power sums).

Lemma 1 (Equivalence of Additive and Multiplicative Estimations).

Theorem 1 (Generalizing [AOST15]).

Corollary 1 (Sublinear complexity for almost uniform reference probabilities, extending [AOST15]).

Corollary 2 (Polynomial complexity for non-negligible reference probabilities).

Theorem 2 (Sample Complexity Lower Bounds).

Corollary 3 (Lower bounds for general case).

Corollary 4 (Polynomial complexity requires non-negligible probability masses).

Lemma 2.