Non-uniform Bounds in the Poisson Approximation with Applications to Informational Distances. I
S.G. Bobkov, G.P. Chistyakov, F. G\"otze

TL;DR
This paper derives asymptotically optimal bounds for how much Bernoulli convolutions deviate from the Poisson distribution, using informational distances like Shannon entropy and chi-squared, based on non-uniform density estimates.
Contribution
It introduces new non-uniform bounds for deviations of Bernoulli convolutions from the Poisson limit in terms of informational distances.
Findings
Established asymptotically optimal bounds for deviations
Applied bounds to non-homogeneous Bernoulli models
Enhanced understanding of informational distances in Poisson approximation
Abstract
We explore asymptotically optimal bounds for deviations of Bernoulli convolutions from the Poisson limit in terms of the Shannon relative entropy and the Pearson -distance. The results are based on proper non-uniform estimates for densities. They deal with models of non-homogeneous, non-degenerate Bernoulli distributions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Mechanics and Entropy · Stochastic processes and financial applications · Sparse and Compressive Sensing Techniques
School of Mathematics, University of Minnesota, USA; research was partially supported by SFB 1283, Humboldt Foundation, and NSF grant
Faculty of Mathematics, University of Bielefeld, Germany; research was partially supported by SFB 1283
NON-UNIFORM BOUNDS IN THE POISSON APPROXIMATION
WITH APPLICATIONS TO INFORMATIONAL DISTANCES. I
S. G. Bobkov1 missing Sergey G. Bobkov School of Mathematics, University of Minnesota 127 Vincent Hall, 206 Church St. S.E., Minneapolis, MN 55455 USA
,
G. P. Chistyakov2 missing Gennadiy P. ChistyakovFakultät für Mathematik, Universität BielefeldPostfach 100131, 33501 Bielefeld, Germany
and
F. Götze2
Friedrich GötzeFakultät für Mathematik, Universität BielefeldPostfach 100131, 33501 Bielefeld, Germany
Abstract.
We explore asymptotically optimal bounds for deviations of Bernoulli convolutions from the Poisson limit in terms of the Shannon relative entropy and the Pearson -distance. The results are based on proper non-uniform estimates for densities. This part deals with the so-called non-degenerate case.
Key words and phrases:
-divergence, Relative entropy, Poisson approximation
1991 Mathematics Subject Classification:
Primary 60E, 60F
1. Introduction
Let be independent Bernoulli random variables taking the two values, (interpreted as a success) and [math] (as a failure) with respective probabilities and . The total number of successes takes values with probabilities
[TABLE]
where the summation runs over all 0-1 sequences such that . Although this expression is difficult to determine in case of arbitrary and large , it can be well approximated by the Poisson probabilities under quite general assumptions. Putting
[TABLE]
let be a Poisson random variable with parameter (for short, ), i.e.,
[TABLE]
It is well-known for a long time that, if is small, the distribution approximates the distribution of , which may be quantified by means of the total variation distance
[TABLE]
where . In particular, based on Stein-Chen’s method, there is the following two-sided bound due to Barbour and Hall involving the functional
[TABLE]
Theorem 1.1 [1]. One has
[TABLE]
Here, the parameter , or more precisely – the ratio (for bounded away from zero), plays a similar role as the Lyapunov ratio in the central limit theorem.
In the i.i.d. case with and fixed , both sides of (1.2) are of the same order . In the case , the upper bound in (1.2) is sharp also in the sense that the second inequality becomes an equality for , ().
Theorem 1.1 refined many previous results in this direction, starting from bounds for the i.i.d. case by Prokhorov [17] and bounds for the general case by Le Cam [14]. In particular, Le Cam obtained the upper bound
[TABLE]
For large Kerstan [12] and respectively Chen [4] improved these bounds to
[TABLE]
See also [10], [23], [21], [18], [19], [2] and the references therein. A certain refinement of the lower bound in (1.2) was obtained in Sason [20].
While (1.2) provides a sharp estimate for the total variation distance, one may wonder whether or not similar approximation bounds hold for the stronger informational distances. As a first interesting example, one may consider the relative entropy
[TABLE]
often called the Kullback-Leibler distance, or an informational divergence of from . It dominates the total variation distance in view of the Pinsker inequality In this context, lower and upper bounds for the relative entropy were studied by Harremoës [6], [7], and Harremoës and Ruzankin [9]. In particular, in the i.i.d. case , it was shown in [9] that
[TABLE]
If with a fixed (or just bounded) value of , these estimates provide the rate of Poisson approximation
[TABLE]
The general non-i.i.d. scenario (with not necessarily equal probabilities ) has been partially studied as well. A simple upper estimate , analogous to Le Cam’s bound (1.3), may be found in [6], cf. also Johnson [11]. It is however not so sharp as (1.4). A tighter upper bound
[TABLE]
was later derived by Kontoyiannis, Harremoës and Johnson [13]. If with , it yields reflecting a correct decay with respect to up to a constant, according to (1.4). Nevertheless, in the general case, Pinsker’s inequality and the bounds (1.2)-(1.3) suggest that a further sharpening such as
[TABLE]
might be possible by involving rather than the functional . To compare the two quantities, note that, by Cauchy’s inequality, . Hence, the inequality (1.6) would be sharper compared to (1.5) modulo a -dependent factor. An upper bound such as (1.6) may also be inspired by the lower bound
[TABLE]
recently derived by Harremoës, Johnson and Kontoyiannis [8]. It is consistent with (1.4) and also shows that the constant is best possible.
As it turns out, (1.6) does hold in the so-called non-degenerate situation, and in essence, (1.7) may be reversed (we say that the range of is non-degenerate, if with , or if , and implicitly mean that the resulting inequalities may contain or as fixed parameters). Moreover, one can further sharpen (1.6) by replacing the relative entropy with the Pearson -distance, as well as with other Rényi/Tsallis distances. To avoid technical complications, let us restrict ourselves to the -divergence which is given by
[TABLE]
It is a divergence type quantity which dominates the relative entropy: . For a general theory of informational distances, we refer interested readers to the recent review by van Erven and Harremoës [5]; an additional material may be found in the books [15], [16], [22], [11]. Here, we reverse the bound (1.7) and prove:
Theorem 1.2. If , then with some absolute constant we have
[TABLE]
The condition is readily fulfilled as long as all (note that, if , then necessarily and then ). Similar bounds as in (1.8) remain to hold under the weaker assumption with a constant depending on , cf. Proposition 6.2 below. This assumption may actually be replaced with the requirement that is bounded. More precisely, in the second part of the paper it will be shown that without any restriction, up to some universal factors, we have
[TABLE]
where
[TABLE]
This shows that in general the bound (1.7) cannot be reversed.
For the study of the asymptotic behavior of and in terms of and , we derive new bounds for the difference between densities of and , that is, for
[TABLE]
To this aim, one has to consider different zones of ’s, distinguishing between “small” and “large” values. The case can be handled directly leading to the non-uniform density bound
[TABLE]
It easily yields sharp upper bounds for all above distances as in Theorems 1.1-1.2 in the case of small , at least up to numerical factors (cf. Proposition 3.3 and 3.4). To treat larger values of , a more sophisticated analysis in the complex plane is involved – using the closeness of the generating functions associated with the sequences and . In particular, the following statement may be of independent interest.
Theorem 1.3. For all integer ,
[TABLE]
Moreover, putting , , we have
[TABLE]
Let us clarify the meaning of the last bound, assuming that with some constant . If and , then with some , it gives
[TABLE]
while for , we also have
[TABLE]
Since is of order at most on a sufficiently large part of measured by , these non-uniform bounds explain the possibility of upper bounds in Theorem 1.2.
The paper is organized as follows. First we describe several general bounds on the probability function of the Poisson law (Section 2). In Sections 3, we consider the deviations and prove Theorem 1.2 in case . Sections 4-5 are devoted to non-uniform bounds and the proof of Theorem 1.3, which is used to complete the proof of Theorem 1.2 for . Uniform bounds for large are discussed in Section 7. There we shall demonstrate that in a typical situation, when the ratio is small, the Poisson approximation considerably improves the rate of normal approximation described by the Berry-Esseen bound in the central limit theorem.
2. Gaussian Type Bounds on Poisson Probabilities
When bounding the Poisson probabilities
[TABLE]
with a fixed parameter , it is convenient to use the well-known Stirling-type two-sided bound:
[TABLE]
In particular, it implies the following Gaussian type estimates.
Lemma 2.1. For all ,
[TABLE]
Moreover, if , then
[TABLE]
Here, the lower bound may be improved in the region as
[TABLE]
Proof. Applying the lower estimate in (2.1), we get
[TABLE]
where
[TABLE]
The function is concave on the half-axis , with . Hence, for all , thus proving the first assertion (2.2).
Assuming that (with ), we necessarily have . In this interval, consider the function with parameter . The second derivative
[TABLE]
is vanishing at the point , while . This means that is concave on and convex on . Since also , we have for all , if and only if this inequality is fulfilled at . But , so the optimal value is . Hence, , and we arrive at the upper bound in (2.3).
Similarly, applying the upper estimate in (2.1), we get
[TABLE]
Choosing , consider the function in the interval . Since , it is concave on and is convex on . Since and , this means that is the point of minimum of . Therefore, , that is, for all , giving the lower bound in (2.3).
Finally, to get the refinement (2.4) in the region , consider the function for . Since and , this function is increasing. Therefore, , that is, for all . ∎
3. Elementary Upper Bounds
We keep the same notations as before; in particular,
[TABLE]
while
[TABLE]
with summation over all 0-1 sequences such that . Clearly, for . To eliminate this condition, one may always assume that is arbitrary, by extending the sequence to in case with . This does not change the value of .
First, let us consider the probability that equals .
Lemma 3.1. If , then
[TABLE]
Proof. Expanding the function near zero according to the Taylor formula as in the previous section, write
[TABLE]
Using for , we have
[TABLE]
Hence
[TABLE]
∎
Note that the condition of Lemma 3.1 is fulfilled automatically, if . In that case, the upper bounds of the lemma may easily be reversed up to numerical factors, for example, in the form
[TABLE]
Moreover, if , then also
[TABLE]
Here, the value turns out to be most essential for obtaining lower bounds, since it immediately yields and with some absolute constant .
Returning to upper bounds, recall the notation . In order to involve the values , we need the following:
Lemma 3.2. If , then
[TABLE]
Moreover, for any ,
[TABLE]
Proof. Denote by the collection of all tuples with integer components such , and let . Representing the Poisson random variable as with independent summands , we have that, for any ,
[TABLE]
Hence, we may start with the formula
[TABLE]
where
[TABLE]
For a 0-1 sequence , put
[TABLE]
By the Taylor formula once more,
[TABLE]
Similarly to (3.1)-(3.2), we have
[TABLE]
Therefore,
[TABLE]
Moreover, since , we have , which in turn implies . The two bounds give so that
[TABLE]
Next, applying the multinomial formula, we have
[TABLE]
and
[TABLE]
Thus,
[TABLE]
The remaining terms participating in correspond to the tuples with , which is only possible for . In that case, restricting for definiteness to the constraint , we have
[TABLE]
Similarly, for any ,
[TABLE]
and summing over , we then get
[TABLE]
It remains to combine this bound with the bound (3.5) and apply both in (3.4). Then we finally obtain that
[TABLE]
If , then , and we arrive at the first inequality in (3.3). In the case , one may use , and then we arrive at the second inequality of the lemma. ∎
Note that when , we also have , and then (3.3) may be replaced with a slightly better bound
[TABLE]
Combining Lemmas 3.1–3.2 (cf. (3.6)), we thus obtain the following non-uniform bound on the deviations of .
Proposition 3.3. If , then, for all ,
[TABLE]
The estimates obtained so far are sufficient to establish Theorem 1.2 in the case . In fact, one may weaken the latter condition to , as shown in the next statement. To compare the lower and upper bounds, we recall the lower bound (1.7) of Harremoës, Johnson and Kontoyiannis [8].
Proposition 3.4. If , then
[TABLE]
where depends on as an increasing continuous function with . In particular, if , then
[TABLE]
Proof. Applying Lemmas 3.1-3.2, we get
[TABLE]
where . Expanding the squares of the brackets in this sum results in
[TABLE]
[TABLE]
which is the same as
[TABLE]
Multiplying by , this gives the desired inequality
[TABLE]
with
[TABLE]
It is easy to check that , so that this function is increasing in , with .
For the range , the term appearing in the definition of may be replaced with (according to the inequality (3.7)), which leads to the constant . ∎
4. Generating functions
The probability function of the Poisson random variable satisfies the equation in integers , which immediately implies
[TABLE]
for any function on (as long as the expectations exist). This identity was emphasized by Chen [4] who proposed to consider an approximate equality
[TABLE]
as a characterization of a random variable being almost Poisson with parameter . This idea was inspired by a similar approach of Charles Stein to problems of normal approximation on the basis of the approximate equality .
Another natural approach to the Poisson approximation is based on the comparison of characteristic functions. Since the random variables and take non-negative integer values, one may equivalently consider the associated generating functions.
The generating function for the Poisson law with parameter is given by
[TABLE]
which is an entire function of the complex variable . Correspondingly, the generating function for the distribution of the random variable in (1.1) is
[TABLE]
which is a polynomial of degree . Hence, the difference between the involved probabilities may be expressed via the contour integrals by the Cauchy formula
[TABLE]
where is the uniform probability measure on the circle of an arbitrary radius .
Note that for with real , the generating functions and become the characteristic functions of and , respectively. Hence, closeness of the distributions of these random variables may be studied as a problem of the closeness of the generating functions on the unit circle.
Let us now describe first steps based on the application of the formula (4.3). Given complex numbers (), we have an identity
[TABLE]
with the convention that for and for . It implies
[TABLE]
According to the product representations (4.1)-(4.2) to be used in (4.3), one should choose here and with . Then
[TABLE]
Therefore
[TABLE]
To estimate the terms in this sum, consider the function
[TABLE]
of the complex variable , where the Taylor integral formula is applied in the second representation. If , then so,
[TABLE]
In particular, for with , we have
[TABLE]
hence , and (4.6) yields
[TABLE]
Integrating over the unit circle in (4.3), we then arrive at the uniform bound:
Proposition 4.1. We have
[TABLE]
This is a weakened variant of Le Cam’s bound , specialized to the one-point set . In order to get a similar bound with arbitrary sets, or develop applications to stronger distances, we need sharper forms of (4.9), with the right-hand side properly depending on .
5. Proof of Theorem 1.3
Applying (4.4) with and in (4.3), one may write this formula as
[TABLE]
with
[TABLE]
where the integration is performed over the uniform probability measure on the circle . Let us write , , and estimate by inserting the absolute value sign inside the integral. Then, using (4.5), we get
[TABLE]
Here, in order to estimate , let us return to the function introduced in (4.7), which we need at the values with .
Case 1: . Since , we have, for any ,
[TABLE]
so, by (4.7),
[TABLE]
Case 2: . Then , so, by (4.8),
[TABLE]
Since , we therefore obtain from (5.2) that
[TABLE]
where
[TABLE]
and
[TABLE]
In order to estimate the last integrals, which we need with and , let us first note that
[TABLE]
Hence, using (), we have
[TABLE]
so that
[TABLE]
Here we applied the inequalities () and used the notation
[TABLE]
Thus, we need to bound from below. If , then , so
[TABLE]
This gives
[TABLE]
In case , we use , implying that
[TABLE]
Therefore in this range we have a similar lower bound, namely
[TABLE]
Since , both lower bounds yield
[TABLE]
As a result, (5.5) is simplified to
[TABLE]
The last integral may be extended to the whole real line, which makes sense for large values of , or one may bound the exponential term in the integrand by 1, which makes sense for small values of . These two ways of estimation lead to
[TABLE]
where is a standard normal random variable. In particular, we get the upper bounds
[TABLE]
In view of , from the definition of we also have the bound
[TABLE]
in case , while for
[TABLE]
Applying these bounds in (5.3), we therefore obtain that may be bounded from above by
[TABLE]
where in case and for . Summing over and recalling (5.1), one can estimate from above by
[TABLE]
Now, letting in the case , (5.6) leads to
[TABLE]
and we obtain the first inequality in (1.9). Letting in the case , (5.6) gives
[TABLE]
which is the second inequality in (1.9).
But, if , one may also use (5.6) with and apply the bound , cf. (2.1), giving
[TABLE]
To simplify the numerical constants, note that and . Recalling that for , we finally get the second inequality (1.10),
[TABLE]
∎
6. Consequences of Theorem 1.3
Under the natural requirement that is bounded away from , the bound (5.7) on may be simplified. As before, we use the notations
[TABLE]
Note that and recall that .
Corollary 6.1. If , , then for any integer ,
[TABLE]
In particular, if , then
[TABLE]
If , we also have
[TABLE]
Proof. The assumption ensures that .
If (), then and , so, the right-hand side of (5.7) is bounded from above by
[TABLE]
Choosing , this expression does not exceed the right-hand side of (6.1). Thus, the inequality (1.10) yields (6.1), which in turn immediately implies (6.2).
In case , we apply the inequality (1.9). Since for , the right-hand side of (1.10) is dominated by the right-hand side of (6.1). Thus, we obtain (6.1) without any constraints on , and (6.2) for all .
In case , necessarily . Hence, the right-hand side of (5.7) may be bounded from above by
[TABLE]
Using to bound the first term in the brackets and to bound the second term (using ), we obtain the bound (6.3). ∎
We are now prepared to extend Proposition 3.4 to larger values of under the assumption that is bounded away from 1. The next assertion, being combined with Proposition 3.4, yields Theorem 1.2 with in case and in case .
Proposition 6.2. If and with , then
[TABLE]
where with, for example, .
Proof. The leftmost lower bound in (6.4) is added according to (1.7) (using the Pinsker inequality, it also follows with some constant from Barbour-Hall’s lower bound in Theorem 1.1). Hence, it remains to show the rightmost upper bound in (6.4). Write
[TABLE]
In the range , we apply the inequality (6.2) which gives
[TABLE]
Hence
[TABLE]
In the sequel, we use a simple moment inequality . We also have and , so that
[TABLE]
with (where we used the assumption on the last step).
In order to estimate , we use the following elementary bound
[TABLE]
which holds for any as long as . For the proof, write
[TABLE]
where
[TABLE]
Since the function is decreasing in , we have This gives
[TABLE]
that is, (6.6). In particular, for and (with ),
[TABLE]
So, by (6.6), and using for the chosen range of , we have
[TABLE]
Hence, by (6.3),
[TABLE]
with . Asymptotically with respect to large , this bound is much better than (6.4). Applying as in (2.5) with and using , we have
[TABLE]
This gives
[TABLE]
As a result, we arrive at the desired upper bound in (6.4).
Finally, let us estimate for the range . Returning to (6.7), we have
[TABLE]
where , . Here
[TABLE]
with , , . All these three functions are convex, while is decreasing. In addition, for . Hence It follows that , and thus is the resulting constant in (6.4). ∎
Remark 6.3. Up to a numerical constant, the upper bound in (6.4) immediately implies an upper bound of Theorem 1.1 in case , in view of the relation . Indeed, (6.4) gives , provided that . But, in the other case , there is nothing to prove, since . Note also that, for , the correct upper bound on the total variation distance is of the form . It may be obtained as a consequence of Lemmas 3.1-3.2.
7. Uniform Bounds. Comparison with Normal Approximation
A different choice of the parameter in the proof of Theorem 1.3 may provide various uniform bounds in the Poisson approximation, like in the next assertion. Using the -norm with respect to the counting measure on , let us focus on the deviations of the densities of and and the deviations of their distribution functions. These distances are thus given by
[TABLE]
Putting in (5.6), we arrive at the next assertion which sharpens Proposition 4.1.
Theorem 7.1. We have
[TABLE]
This uniform bound is not new; with a non-explicit numerical factor, it corresponds to Theorem 3.1 in Cekanavicius [3], p. 53. For , this relation is simplified to
[TABLE]
which cannot be improved (modulo a numerical factor) in view of the lower bounds on with mentioned in Section 3. We also have a similar bound for the Kolmogorov distance, , which follows from the upper bound for the stronger total variation distance as in Theorem 1.1.
When, however, is large (and say all ), one would expect to achieve more accurate bounds when replacing the Poisson approximation for by the normal law with mean and variance . Indeed, suppose, for example, that , so that has a binomial distribution with parameters , while the approximating Poisson distribution has parameter with . Here (1.2) only yields , which means that there is no Poisson approximation with respect to the total variation! Nevertheless, the approximation is still meaningful in a weaker sense in terms of the Kolmogorov distance , as well as in terms of . In this case, both and are almost equal to , and the Berry-Esseen theorem provides a correct bound via the triangle inequality for . Since (which holds true for all probability distributions on ), we also have . Note that this inequality also follows from Theorem 7.1. Indeed, when , (7.1) is simplified to
[TABLE]
which yields a correct order for growing . Thus, the two approaches are equivalent for this particular (i.i.d.) example.
To realize whether or not the normal approximation is better or worse than the Poisson approximation in the general non-i.i.d. situation (that is, with different ’s), let us evaluate the corresponding Lyapunov ratio in the central limit theorem and apply the Berry-Esseen bound , where the random variable is distributed according to . Since , the Lyapunov ratio for the sequence is given by
[TABLE]
(note that ). Hence , up to some absolute constant . A similar bound holds for as well when representing as the sum of independent Poisson random variables with parameters . Namely, for the sequence , we have
[TABLE]
Therefore, and hence, by the triangle inequality, . In particular, in a typical situation where , the normal approximation yields
[TABLE]
with some absolute constant . But, this bound is surprisingly worse than (7.2) as long as .
Consider as an example for . Then , , and we get in (7.2), while (7.3) only yields . This example is also illustrative when comparing Theorem 1.2 with (1.5). The first one provides a correct asymptotic (within absolute factors), while (1.5) only gives .
Acknowledgement. The authors would like to thank Igal Sason and two referees for valuable comments and drawaing our attention to additional references related to the Poisson approximation in informational distances.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Barbour, A. D.; Hall, P. On the rate of Poisson convergence. Math. Proc. Cambridge Philos. Soc. 95 (1984), no. 3, 473–480.
- 2[2] Barbour, A. D.; Holst, L.; Janson, S. Poisson approximation. Oxford Studies in Probability, 2. Oxford Science Publications. The Clarendon Press, Oxford University Press, New York, 1992. x+277 pp.
- 3[3] Čekanavicius, V. Approximation methods in Probability Theory. Universitext. Springer (2016), 274 pp.
- 4[4] Chen, L. H. Y. Poisson approximation for dependent trials. Ann. Probability 3 (1975), no. 3, 534–545.
- 5[5] van Erven, T., Harremoës, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inform. Theory 60 (2014), no. 7, 3797–3820.
- 6[6] Harremoës, P. Binomial and Poisson distributions as maximum entropy distributions. IEEE Trans. Inform. Theory 47 (2001), no. 5, 2039–2041.
- 7[7] Harremoës, P. Convergence to Poisson distribution in information divergence. Preprint 2, Math. Department, University of Copenhagen, Feb. 2003.
- 8[8] Harremoës, P.; Johnson, O.; Kontoyiannis. Thinning and information projections. ar Xive:1601.04255, Jan. 2016.
