On the Complexity of Estimating Renyi Divergences
Maciej Skorski

TL;DR
This paper investigates the difficulty of estimating Renyi divergences between distributions, revealing that sample complexity depends heavily on rare events and can be unbounded, especially for divergence orders greater than one.
Contribution
The paper extends previous work on Renyi entropy estimation by providing new bounds and techniques, highlighting the dependence of sample complexity on small probability events.
Findings
Sample complexity is unbounded for small probability events.
For divergence order > 1, bounds depend on probabilities of p and q.
Worst-case complexity is polynomial only when q's probabilities are non-negligible.
Abstract
This paper studies the complexity of estimating Renyi divergences of discrete distributions: observed from samples and the baseline distribution known \emph{a priori}. Extending the results of Acharya et al. (SODA'15) on estimating Renyi entropy, we present improved estimation techniques together with upper and lower bounds on the sample complexity. We show that, contrarily to estimating Renyi entropy where a sublinear (in the alphabet size) number of samples suffices, the sample complexity is heavily dependent on \emph{events occurring unlikely} in , and is unbounded in general (no matter what an estimation technique is used). For any divergence of order bigger than , we provide upper and lower bounds on the number of samples dependent on probabilities of and . We conclude that the worst-case sample complexity is polynomial in the alphabet size if and only if theā¦
| Assumption | Complexity | Comment | Reference |
|---|---|---|---|
| almost uniform , complexity sublinear | CorollaryĀ 1 | ||
| no assumptions | complexity at least square root | CorollaryĀ 3 | |
| negligible masses in , super-polynomial complexity | CorollaryĀ 4 | ||
| non-negligible mass in , polynomial complexity | CorollaryĀ 2 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Mechanics and Entropy Ā· Statistical Distribution Estimation and Applications Ā· Statistical Methods and Inference
On the Complexity of Estimating Renyi Divergences
Maciej Skorski Supported by the European Research Council consolidator grant (682815-TOCNeT). IST Austria
Email: [email protected]
Abstract
This paper studies the complexity of estimating Renyi divergences of discrete distributions: observed from samples and the baseline distribution known a priori. Extending the results of Acharya et al. (SODAā15) on estimating Renyi entropy, we present improved estimation techniques together with upper and lower bounds on the sample complexity.
We show that, contrarily to estimating Renyi entropy where a sublinear (in the alphabet size) number of samples suffices, the sample complexity is heavily dependent on events occurring unlikely in , and is unbounded in general (no matter what an estimation technique is used). For any divergence of order bigger than , we provide upper and lower bounds on the number of samples dependent on probabilities of and . We conclude that the worst-case sample complexity is polynomial in the alphabet size if and only if the probabilities of are non-negligible.
This gives theoretical insights into heuristics used in applied papers to handle numerical instability, which occurs for small probabilities of . Our result explains that small probabilities should be handled with care not only because of numerical issues, but also because of a blow up in sample complexity.
Keywords:
Renyi divergence, sampling complexity, anomaly detection
I Introduction
I-A Renyi Divergences in Anomaly Detection
A popular statistical approach to detect anomalies in real-time data is to compare the empirical distribution of certain features (updated on the fly) against a stored āprofileā (learned from past observations or computed off-line) used as a reference distribution. Significant deviations of the observed distribution from the assumed profile trigger an alarmĀ [GMT05].
This technique, among many other applications, is often used to detect DDoS attacks in network trafficĀ [GCFJP+15, PKY15]. To quantify the deviation between the actual data and the reference distribution, one needs to employ a suitable dissimilarity metric. In this context, based on empirical studies, Renyi divergences were suggested as good dissimilarity measuresĀ [LZY09, XLZ11, BBK15, GCFJP+15, PKY15].
While the divergence can be evaluated based on theoretical models111 For example, one uses fractional Brownian motions to simulate real network traffic and Poisson distributions to model DDoS traffic[XLZ11]., much more important (especially for real-time detection) is the estimation on the basis of samples. The related literature is focused mainly on tunning the performance of specific implementations, by choosing appropriate parameters (such as the suitable definition or the sampling frequency) based on empirical evidence. On the other hand, not much is known about the theoretical performance of estimating Renyi divergences for general discrete distributions (continuous distributions need extra smoothness assumptionsĀ [SP14]). A limited case is estimating Renyi entropyĀ [AOST15] which corresponds to the uniform reference distribution.
In this paper, we attempt to fill the gap by providing better estimators for the Renyi divergence, together with theoretical guarantees on the performance. In our approach, motivated by mentioned applications to anomaly detection, we assume that the reference distribution is explicitly known and the other distribution can only be observed from i.i.d. samples.
I-B Our Contribution and Related Works
Better estimators for a-priori known reference distributions
In the literature Renyi divergences are typically estimated by straightforward plug-in estimators (see [LZY09, BBK15, LZY09, XLZ11, BBK15, GCFJP+15, PKY15]). In this approach, one puts the empirical distribution (estimated from samples) into the divergence formula, in place of the true distribution. Unfortunately, they have worse statistical properties, e.g. are heavily biased. This affects the number of samples required to get a reliable estimate.
To obtain reliable estimates within a possible small number of samples, we extend the techniques from [AOST15]. The key idea is to use falling powers to estimate power sums of a distribution (this trick is in fact a bias correction method). The estimator is illustrated in AlgorithmĀ 1 below.
For certain cases (where the reference distribution is close to uniform) we estimate the divergence with the number of samples sublinear in the alphabet size, whereas plug-in estimators need a superlinear number of samples. In particular for the uniform reference distribution , we recover the same upper bounds for estimating Renyi entropy as in [AOST15].
Upper and lower bounds on the sample complexity
We show that the sample complexity of estimating divergence of unknown observed from samples to an explicitly known is dependent on the reference distribution itself. When doesnāt take too small probabilities, non-trivial estimation is possible, even sublinear in the alphabet size for any . However when takes arbitrarily small values, the complexity is dependent on inverse powers of probability masses of and is * unbounded* (for a fixed alphabet), without extra assumptions on . We stress that these lower bounds are no-go results independent of the estimation technique. For a more quantitative comparison, see TableĀ I.
Complexity instability vs numerical instability
Our results provide theoretical insights about heuristic āpatchesā to the Renyi divergence formula suggested in the applied literature. Since the formula is numerically unstable when one of the probability masses becomes arbitrarily small (see DefinitionĀ 2), authors suggested to omit or round up very small probabilities of (see for example [LZY09, PKY15]).
In accordance to this, as shown in TableĀ I, the sample complexity is also unstable when unlike events occur in the reference distribution . Moreover, this is the case even if the distribution is perfectly known. We therefore conclude that small probabilities of are very subtle not only because of numerical instability, but more importantly because the sample complexity is unstable.
I-C Our techniques
For upper bounds we merely borrow and extend techniques from [AOST15]. For lower bounds our approach is however different. We find a pair of distributions which are close in total variation yet with much different divergences to , by a variational approach (writing down an explicit optimization program) As a result, we can obtain our lower bounds for any accuracy. In turn, the argument in [AOST15], even if can be extended to the Renyi divergence, has inherit limitations that make it work only for sufficiently small accuracies. Thus we can say that our lower bound technique, in comparison to [AOST15], offers lower bounds valid in all regimes of the accuracy parameter, in particular for constant values used in the applied literature.
In fact, our technique strictly improves known lower bounds on estimating collision entropy. Taking the special case when is uniform, we obtain that the sample complexity for estimating collision entropy is even for constant accuracy, while results in [AOST15] guarantees this only for very small (no exact threshold is given, and hidden constants may be dependent on ), which is captured by the notation .
I-D Organization
In SectionĀ II we introduce necessary notions and notations. Upper bounds on the sample complexity are discussed in SectionĀ III and lower bounds in SectionĀ IV. We conclude our work in SectionĀ V.
II Preliminaries
For a distribution over an alphabet we denote . All logarithms are at base .
Definition 1** (Total variation).**
The total variation of two distributions over the same finite alphabet equals .
Below we recall the definition of Renyi divergence (we refer the reader to [EH14] for a survey of its properties).
Definition 2** (Renyi divergence).**
The Renyi divergence of order (in short: Renyi -divergence) of two distributions having the same support is defined by
[TABLE]
By setting uniform we get the relation to Renyi entropy.
Remark 1** (Renyi entropy vs Renyi divergence).**
For any over the Renyi entropy of order equals
[TABLE]
where is the uniform distribution over .
Definition 3** (Renyiās divergence estimation).**
Fix an alphabet of size , and two distributions and over . Let be an algorithm which receives independent samples of on its input. We say that provides an additive -approximation to the Renyi -divergence of from if
[TABLE]
Definition 4** (Renyiās divergence estimation complexity).**
The sample complexity of estimating the Renyi divergence given with probability error and additive accuracy is the minimal number for which there exists an algorithm satisfying EquationĀ 2 for all .
It turns out that it is very convenient not to work directly with estimators for Renyi divergence, but rather with estimators for weighted power sums.
Definition 5** (Divergence power sums).**
The power sum corresponding to the divergence of and is defined as
[TABLE]
The following lemma shows that estimating divergences (EquationĀ 1) with an absolute relative error of and corresponding power sums (EquationĀ 3) with a relative error of is equivalent
Lemma 1** (Equivalence of Additive and Multiplicative Estimations).**
Suppose that is a number such that , where . Then satisfies . The other way around, if is such that , where , then satisfies .
The proof is a straightforward consequence of the first order Taylorās approximation, and will appear in the full version.
III Upper Bounds on the Sample Complexity
Below we state our upper bounds for the sample complexity. The result is very similar to the formula in [AOST15] before simplifications, except the fact that in our statement there are additional weights coming from possibly non-uniform and it canāt be further simplified.
Theorem 1** (Generalizing [AOST15]).**
For any distributions over an alphabet of size , if the number satisfies
[TABLE]
then the complexity of estimating the Renyi -divergence of to given is at most .
The proof is deferred to the appendix, below we discuss corollaries. The first corollary shows that the complexity is sublinear when the reference distribution is close to uniform.
Corollary 1** (Sublinear complexity for almost uniform reference probabilities, extending [AOST15]).**
Let be distributions over an alphabet of size , and be a constant. Suppose that and . Then the complexity of estimating the Renyi -divergence with respect to , up to constant accuracy and probability error at most , is .
As shown in the next corollary, the complexity is polynomial only if the reference probabilities are not negligible.
Corollary 2** (Polynomial complexity for non-negligible reference probabilities).**
Let be distributions over an alphabet of size . Suppose that , and let be a constant. Then the complexity of estimating the Renyi -divergence with respect to , up to a constant accuracy and probability error at most (in the sense of DefinitionĀ 4) is .
Proof of CorollaryĀ 2.
Under our assumptions . Since , we get . By TheoremĀ 1, we conclude that the sufficient condition is
[TABLE]
Therefore, we need to chose such that
[TABLE]
By the discussion in [AOST15] we know that for we have . Thus we need to find that satisfies
[TABLE]
By the inequality (which follows by the Taylorās expansion for any positive real number ) and the symmetry of binomial coefficients we need
[TABLE]
By the Taylor expansion valid for it suffices if
[TABLE]
which finishes the proof. ā
Proof of CorollaryĀ 1.
The corollary can be concluded by inspecting the proof of CorollaryĀ 2. The bounds are the same except that the factor is replaced by . For constant , the final condition reduces to . ā
IV Sample Complexity Lower Bounds
The following theorem provides lower bounds on the sample complexity for any distribution and . Since the statement is somewhat technical,we discuss only corollaries and refer to the appendix for a proof.
Theorem 2** (Sample Complexity Lower Bounds).**
Let be two fixed distributions, and numbers be given by
[TABLE]
for some satisfying , , and . Then for any fixed , estimating the Renyi divergence to (in the sense of DefinitionĀ 3) with error probability and up to a constant accuracy requires is at least
[TABLE]
samples from .
By choosing appropriate numbers in TheoremĀ 2 we can obtain bounds for different settings.
Corollary 3** (Lower bounds for general case).**
Estimating the Renyi divergence requires always samples.
Proof of CorollaryĀ 3.
In TheoremĀ 2 we chose the uniform and such that for the index minimizing , and elsewhere. This gives us and (the constant dependent on ) which is bigger than , because by our choice of . ā
Corollary 4** (Polynomial complexity requires non-negligible probability masses).**
For sufficiently large , if then there exists a distribution dependent on such that estimation is at least .
Proof of CorollaryĀ 4.
Fix one alphabet symbol and real positive numbers . Let put the probability on and be uniform elsewhere. Also let put the probability on and be uniform elsewhere. We have
[TABLE]
and
[TABLE]
Choose so that it satisfies
[TABLE]
for example (works for and ) we obtain from TheoremĀ 2 (where we take for and constant elsewhere, and our conditions on ensure that and respectively ) that for sufficiently large the minimal number of samples is
[TABLE]
Note that if our choice of implies that also , and thus the corollary follows. ā
V Conclusion
We extended the techniques recently used to analyze the complexity of entropy estimation to the problem of estimating Renyi divergence. We showed that in general there are no uniform bounds on the sample complexity, and the complexity is polynomial in the alphabet size if and only if the reference distribution doesnāt take negligible probability masses (explained by the numerical properties of the divergence formula).
Appendix A Proof of TheoremĀ 1
Proof of TheoremĀ 1 (sketch).
We follow essentially the same proof strategy as in [AOST15], with the only difference that we estimate weighted power sums corresponding to the divergence, instead of sums corresponding to the entropy. Let be the empirical frequency of the -th symbol in the stream . Consider the following estimator for .
[TABLE]
Note that this is precisely the power sum defined in AlgorithmĀ 1. By LemmaĀ 1 it suffices to consider this estimator with the multiplicative error (for constant ).
In particular, we use the fact that we can randomize and make it a sample from the Poisson distribution of the same mean. This transformation doesnāt hurt the estimator convergence, but on the other hand makes the empirical frequencies independent (see [AOST15] for more details).
Under the Poisson sampling and with notations as in AlgorithmĀ 1 we arrive at the formula
[TABLE]
The next reduction is an observation that is suffices to construct an estimator that fails with probability at most , as the success probability can be then amplified by the median trick [AOST15]. In general, it is pretty standard in the literature to simply present estimators with constant error probability [CDGR16].
Letās define the success event
[TABLE]
By Chebyszevās Inequality we obtain the following bound
[TABLE]
(consistent with [AOST15] for uniform ) which finishes the proof. ā
Appendix B Proof of TheoremĀ 2
Proof of TheoremĀ 2.
We can assume that , as otherwise the estimate on is trivial. We start with the following lemma (a similar technique is used in [AOST15], our exposition is different)
Lemma 2**.**
Suppose that there exists a -estimator for the Renyi divergence as in DefinitionĀ 3, which uses samples, where . Then the following is true: any two distributions that are -close in total variation, must satisfy .
Proof of LemmaĀ 2.
The lemma follows by the following observation: if the estimator fails with probability at most on both distributions and , then one can build a distinguisher for an -fold products and by comparing the algorithm outputs against the threshold . If , this distinguisher works with advantage in total variation. We complete the proof by the standard hybrid argument: if -fold products and are away by in total variation, then the distributions and must be away. ā
By combining this with LemmaĀ 1, it suffices to prove that for some that are close in total variation by .
Recall that . Consider any vector such that and (in particular, is a probability distribution). By the first order Taylor approximation
[TABLE]
Assuming that we obtain
[TABLE]
changing variables by , denoting and gives us and that are away in total variation and
[TABLE]
Consider now two cases. Assume first . The inequality implies an additive error of in estimation. Note that can be scaled (by a factor smaller than 1, as ) so that . The distance between and is then at least . Suppose now that . Similarly, by scaling (which is possible because we have ) we can arrive at . Then the inequality yields an additive error in estimation, and the distance between and is . The bounds on follow, because by LemmaĀ 2 we must have or . ā
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[AOST 15] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh and Himanshu Tyagi āThe Complexity of Estimating RĆ©nyi Entropyā In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015 , 2015, pp. 1855ā1869 DOI: 10.1137/1.9781611973730.124 Ā· doiĀ ā
- 2[BBK 15] Monowar H. Bhuyan, D. K. Bhattacharyya and Jugal K. Kalita āAn empirical evaluation of information metrics for low-rate and high-rate D Do S attack detectionā In Pattern Recognition Letters 51 , 2015, pp. 1ā7 DOI: 10.1016/j.patrec.2014.07.019 Ā· doiĀ ā
- 3[CDGR 16] ClĆ©ment L. Canonne, Ilias Diakonikolas, Themis Gouleakis and Ronitt Rubinfeld āTesting Shape Restrictions of Discrete Distributionsā In 33rd Symposium on Theoretical Aspects of Computer Science, STACS 2016, February 17-20, 2016, OrlĆ©ans, France , 2016, pp. 25:1ā25:14 DOI: 10.4230/LIP Ics.STACS.2016.25 Ā· doiĀ ā
- 4[EH 14] Tim Erven and Peter HarremoĆ«s āRĆ©nyi Divergence and Kullback-Leibler Divergenceā In IEEE Trans. Information Theory 60.7 , 2014, pp. 3797ā3820 DOI: 10.1109/TIT.2014.2320500 Ā· doiĀ ā
- 5[GCFJP+15] Vincenzo Gulisano, Mar Callau-Zori, Zhang Fu, Ricardo JimĆ©nez-Peris, Marina Papatriantafilou and Marta PatiƱo-MartĆnez āSTONE: A streaming D Do S defense frameworkā In Expert Syst. Appl. 42.24 , 2015, pp. 9620ā9633 DOI: 10.1016/j.eswa.2015.07.027 Ā· doiĀ ā
- 6[GMT 05] Yu Gu, Andrew Mc Callum and Don Towsley āDetecting Anomalies in Network Traffic Using Maximum Entropy Estimationā In Proceedings of the 5th ACM SIGCOMM Conference on Internet Measurement , IMC ā05 Berkeley, CA, USA: USENIX Association, 2005, pp. 32ā32 URL: http://dl.acm.org/citation.cfm?id=1251086.1251118
- 7[LZY 09] Ke Li, Wanlei Zhou and Shui Yu āEffective metric for detecting distributed denial-of-service attacks based on information divergenceā In IET Communications 3.12 , 2009, pp. 1851ā1860 DOI: 10.1049/iet-com.2008.0586 Ā· doiĀ ā
- 8[PKY 15] Sirikarn Pukkawanna, Youki Kadobayashi and Suguru Yamaguchi āNetwork-based mimicry anomaly detection using divergence measuresā In International Symposium on Networks, Computers and Communications, ISNCC 2015, Yasmine Hammamet, Tunisia, May 13-15, 2015 , 2015, pp. 1ā7 DOI: 10.1109/ISNCC.2015.7238570 Ā· doiĀ ā
