Non parametric estimation of joint, Renyi-Stallis entropies and mutual information and asymptotic limits
Amadou Diadie Ba, Gane Samb Lo, Cheikh Tidiane Seck

TL;DR
This paper introduces a new non-parametric method for estimating joint entropies and mutual information of discrete variables, with proven consistency and asymptotic properties validated through simulations.
Contribution
It presents a novel estimator for joint probability mass functions and entropy measures, along with theoretical guarantees and empirical validation.
Findings
Estimator is almost surely consistent
Central limit theorems are established for the estimators
Simulation results validate the theoretical properties
Abstract
This paper proposes a new method for estimating the joint probability mass function of a pair of discrete random variables. This estimator is used to construct joint Shannon R\'enyi-Tsallis entropies, and the mutual information estimates of a pair of discrete random variables. Almost sure consistency and central limit Theorems are established. Our theorical results are validated by simulations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Mechanics and Entropy · Statistical Distribution Estimation and Applications · Complex Systems and Time Series Analysis
Joint, Renyi-Stallis entropies and mutual information and asymptotic limits.
Abstract.
This paper proposes a new method for estimating the joint probability mass function of a pair of discrete random variables. This estimator is used to construct joint Shannon Rényi-Tsallis entropies, and the mutual information estimates of a pair of discrete random variables. Almost sure consistency and central limit Theorems are established. Our theorical results are validated by simulations.
Amadou Diadie Ba*(1), Gane Samb Lo(1,2,3), Cheikh Tidiane Seck(4)*.
(1) LERSTAD, Université Gaston Berger, Sénégal,
(2) Associate Researcher, LASTA, Pierre et Marie University, Paris, FRANCE
(3) Assiated Professor, African University of Sciences and Technology, Abuja, NIGERIA.
(4) Université Alioune Diop de Bambey, Sénégal.
Correspondence : Amadou Diadie Ba,
**2010 Mathematics Subject Classifications : 94A17, 41A25, 62G05, 62G20, 62H12, 62H17.
**
Key Words and Phrases : Joint entropy estimation, Joint Rényi, Tsallis entropy, Mutual information estimation.
1. Introduction
1.1. Motivation
Let and be two discrete random variables defined on a same probability space , with respectives values and (with and ).
The information amount of (or content in) the outcome is (see Carter (2014))
[TABLE]
where .
The joint probability distribution of the events
, coupled with the information amount of every event,
, forms a random variable whose expected value is the average amount of information, or joint entropy (more specifically, joint Shannon entropy), generated by this joint distribution.
Definition 1**.**
Let and be two discrete random variables defined on a probability space , taking respectives values in the finite countable spaces
* and (with and ), and with joint probabilities mass function (p.m.f.) , that is,*
[TABLE]
(1) The joint Shannon entropy (JSE) of the (ordered) pair of random variables is given by
[TABLE]
Entropy is usually measured in bits (binary information unit) (if ), nats (if natural ), or hartley( if ), depending on the base of the logarithm which is used to define it.
For ease of computations and notation convenience, we use the natural logarithm, since logarithms of varying bases are related by a constant.
In what follows, and will (typically) denote the marginal distributions of the bivariate variable whose distribution is denoted by . Additionally, entropies will be considered as functions of p.m.f.’s, since they only take into account probabilities of observing specific events.
Note that over all pair of random variables that take on at most values with positive probability, the ones with the largest entropy are those which are uniform on their ranges, and these random variables have entropy exactly viz
[TABLE]
Inspired by the study of -deformed algebras and special functions, various generalizations have been investigated.
Most notably, Rényi (1960) proposed a one parameter family of entropies extending Shannon entropy.
(b) The joint Rényi entropy (JRE) of the pair of random variables is defined as
[TABLE]
with , which, in particular, reduces to the joint Shannon entropy in the limit .
(c) Also, the joint Tsallis entropy (JTE) of the pair of random variables defined by
[TABLE]
has generated a large burst of research activities.
(d) The mutual information (MI) of the pair of random variables defined by
[TABLE]
represents the amount of information that reveals about (or vice versa).
Here and .
In what follows, . An important relation between JRE, JTE and the joint power sum (JPS) is
[TABLE]
where
[TABLE]
Mutual information is closely related to entropy by
[TABLE]
where
[TABLE]
is the entropy of and similarly for .
This form can also be used for a Venn-diagram, as shown in Figure 1.
In this paper, our aim is to estimate directly entropies defined before by using a plug-in approach. (1.8) allows to obtain an estimator for MI by estimating , , and and adding them up. This corresponds to the -principle upon which number of plug-in estimators are based (see Kraskov et al. (2004) for precisions on this principle).
In contrast, we propose in this paper a plug-in approach that is essentially based on the estimation of the joint probability distribution from which, we can calculate the marginal distributions , and then the quantities , , , and .
This approach is motived by the fact that studying the joint probability distribution of the pair of discrete random variables taking values, resp., in the finite sets and is equivalent to studying the probability distribution of the mutually exclusive possible values of . This allows us to transform the problem of estimating the joint discrete distribution of the pair into the problem of estimating a simple distribution, say , of a single discrete random variable suitably defined. Given an i.i.d sample of this latter random variable , we shall take, as an estimator of the law , the associated empirical measure and plug it into formulas (1.1), (1.2), (1.3), and (1.4) to obtain estimates of entropies concerned.
Before going to our entropies estimation, let highlight some important applications of them. The importance of information measures transcends information theory. Indeed, since shortly after their inception, a wide variety of experimental sciences have found significant applications for joint Shannon entropy, Reyni and Tsallis entropies, and mutual information. For example,
Finance Philippatos Wilson (1972);
Machine learning Moon and al. (2017);
Biological sciences Timme Lapish (2018)-Krishnaswamy et al. (2014);
Statistics Liu et al. (2012)-Lewi et al. (2006)-Pál et al. (2010)-Christensen (1997);
Sociology Reshef et al. (2011);
Neuroscience Rieke (1999)-Schneidman et al. (2003).
Frequently, in those applications, the need arises to estimate information measures empirically : data are generated under an unknown probability law, and we would like to estimate these information measures from these ones.
1.2. Previous work
mutual information estimation from samples remains an active research problem (see Walters et al. (2009), Khan et al. (2007), and Sricharan et al. (2013), to cite a few).
Antos and Kontoyiannis (2001) defined estimator for mutual information of discrete random variables and and showed that,
[TABLE]
provided that .
Deemat (2013), using the histogram method and under appropriate assumptions on the tail behavior of the random variables, showed that the mutual information estimate is consistent in probability, that is, for any ,
[TABLE]
This result will also be established by Gao et al. (2017a) using the Kraskov–Stogbauer –Grassberger (KSG) method and with some regular and smoothness conditions on resp. the Radon-Nikodym derivatives of and and on the joint p.d.f. and with assumptions on the joint entropy .
Gao et al. (2017), using the Local Gaussian Density Estimation method, proved that the mutual information estimate is asymptotically unbiaised that is
[TABLE]
By the nearest neighbors (K-NN) method, Gao et al. (2017) defined novel estimator for mutual information of mixture of random variables . They proved that the proposed estimator is asymptotically unbiaised that is
[TABLE]
provided that and as
Furthermore, they proved that, if in addition as , then
[TABLE]
Goebel et al. (2005) established by Taylor approximation that, in case of independence of the two random variables and , then
[TABLE]
is a second-order approximation of the mutual information.
Then they deduced that if is small enough, ( bit) i.e. and are independent or weakly associated random variables and sufficiently large () then approximately follows a gamma distribution with parameters and .
In this case the mean and variance are given as
[TABLE]
Xianli et al. (2018) used the Jackknife approach of the kernel with equalized bandwidth to estimate the S.m.i for a pair of discrete random variables and mixed random variables (with neither purely continuous distributions nor purely discrete distributions).
Beknazaryan et al. (2019) studied the mutual information estimation for mixed pair random variables. They developpped a kernel method to estimate the mutual information between the two random variables. The estimates enjoyed a central limit theorem under some regular conditions on the distributions.
1.3. Overview of the paper
The rest of the paper is organized as follows. In section 2, we define the auxiliary random variable whose law is exactly the joint law of . In section 3, we construct plug-in estimates of joint p.m.f.’s of and estimates of JSE, JRE, JTE, and of MI. Section 4 establishes consistency and asymptotic normality properties of the estimates. Section 5 is devoted to an independence test based on mutual information. In section 6 we provide a simulation study to assess the performence of our estimators and we finish by a conclusion in section 7.
2. Construction of the random variable with law
Let and two discrete random variables defined in the same probability space and taking the following values
[TABLE]
resp. ( and ).
In addition let a random variable defined on the same probability space and taking the following values :
[TABLE]
Denote .
Simple computations give that for any , we have and conversely for any we have
[TABLE]
where denotes the largest integer less or equal to .
For any possible joint values of the ordered pair , we assign the single value of such that
[TABLE]
and conversely, for any possible value of , is assigned the single pair of values such that
[TABLE]
This means that for any , we have
[TABLE]
where and conversely, for any
[TABLE]
Table 1 illustrates the correspondance between and , for (.
From there, the marginals *p.m.f.’*s are expressed from p.m.f.’s of the random variable by
[TABLE]
Finally, JSE, JRE, JTE and MI are expressed simply in terms of through (2.4), that is
[TABLE]
where .
We may give now the following remark :
For most of univariate or multivariate entropies, we may have computation problems. So without loss of generality, suppose
[TABLE]
If Assumption (2.7) holds, we do not have to worry about summation problems. This explain why Assumption (2.7) is systematically used in a great number of works in that topics, for example, in Hall (1987), Singh and Poczos (2014), Krishnamurthy et al. (2014), and recently Ba et al. (2019), to cite a few.
3. Estimation
In this section, we construct estimate of p.m.f. from i.i.d. random variables according to , and we give some inescapable results needed in the sequel, and finally construct the plug-in estimates of the entropies cited above.
Let be i.i.d. random variables from and according to .
Here, it is worth noting that, in the sequel, , with and integers strictly greater than . This means that can not be a prime number so that (2.5) holds.
For a given , define the easiest and most objective estimator of , based on the i.i.d sample by
[TABLE]
where for a fixed .
This means that, for a given , an estimate of based on the i.i.d sample according to is given by
[TABLE]
where for fixed .
From (2.6), estimate of each of the marginals pdf’s and are
[TABLE]
with
[TABLE]
In the following, we use equally or since they are equal in consideration of (2.4) and (2.5) and we denote
[TABLE]
Before going further, let give some results concerning the empirical estimator (3.1).
For a given , this empirical estimator is strongly consistent and asymptotically normal. Precisely, for a fixed , when tends to infinity,
[TABLE]
where .
These asymptotic properties derive from the law of large numbers and central limit theorem.
Here and in the following, means the almost sure convergence, , the convergence in distribution, and , means equality in distribution.
Recall that, since for a fixed has a binomial distribution with parameters and success probability , we have
[TABLE]
Denote
[TABLE]
where
By the asymptotic Gaussian limit of the multinomial law (see for example Lo (2016), Chapter 1, Section 4), we have
[TABLE]
where and is the covariance matrix which elements are :
[TABLE]
By denoting then, we have
[TABLE]
As a consequence, JSE, JRE, and JTE are estimated from the sample by their plug-in counterparts, meaning that we simply insert the consistent p.m.f. estimator computed from (3.1) in place of JSE, JRE, and JTE expresions viz :
[TABLE]
where and , and are given resp. by (3.2), and (3.3).
In addition, define the JPS estimate
[TABLE]
In the following, we present asymptotic limits of these empirical estimators.
4. Statements of the main results
In this section, we state and prove almost sure consistency and central limit theorem for the estimators defined above.
4.1. Asymptotic limits of joint Shannon entropy estimate.
Denote
[TABLE]
Proposition 1**.**
Let a probability distribution and be generated by i.i.d samples according to and given by (3.5), assumption (2.7) be satisfied. Then the following asymptotic results hold
[TABLE]
Proof.
Define the function by .
Let , and set . We have
[TABLE]
by the mean values theorem and where is some number lying in .
Applying again the main value Theorem to the derivative function of , we obtain
[TABLE]
where . Replacing in (4.5), it yields
[TABLE]
Now summing over , it follows that
[TABLE]
so that
[TABLE]
Hence
[TABLE]
since, as ,
[TABLE]
Which proves the claim (4.4).
Going back to (4.6), we have
[TABLE]
where
[TABLE]
The asymptotic Gaussian limit of the multinomial law (3.8), garantees that
[TABLE]
where the asymptotic variance, , equals to
[TABLE]
It remains to prove that converges in probability to [math] as .
We have
[TABLE]
By the Bienaymé-Tchebychev inequality, we have, for any fixed and for any
[TABLE]
Therefore which entails that since, as tends to , we have
[TABLE]
All this proves the claim (4.4) and ends the proof of the Proposition 1 ∎
4.2. Asymptotic limit of joint Renyi and Tsallis entropies estimates
The following proposition concerns the asymptotic limits of JPS estimate given by
[TABLE]
The proof is the same as that of Proposition 1, just replace the function by the function Hence omitted.
For , denote
[TABLE]
Proposition 2**.**
Under the conditions as in Proposition 1, the asymptotic results hold
[TABLE]
Turning now to our second result, note that the relation (1.5) suggests that similar results of Proposition 2 could be also extended to the JRE.
For , denote
[TABLE]
Proposition 3**.**
Under the same assumptions as in Proposition 2, the following asymptotic results hold
[TABLE]
Proof.
For we have
[TABLE]
Using a Taylor expansion of it follows that almost surely,
[TABLE]
Finally this, combined with (4.8) of Proposition 2, proves the claim (4.10).
Let prove the claim (4.11).
Using the same technics as in the proof of Proposition 1, we obtain
[TABLE]
where . So that dividing each member by , we get
[TABLE]
Now by Taylor expansion of , it follows that, almost surely,
[TABLE]
thus, from (4.12), we obtain
[TABLE]
but using (4.9), we have that
[TABLE]
Finally
[TABLE]
with
[TABLE]
This proves the claim (4.11) and ends the proof of the Proposition 3 .
∎
Note also that , the relation (1.6) suggests that similar results of Proposition 2 could be also extended to the JTE.
For , denote
[TABLE]
Proposition 4**.**
Under the same assumptions as in Proposition 2, the following asymptotic results hold
[TABLE]
Proof.
The proof follows very simply from Proposition 2, by writing
[TABLE]
∎
4.3. Asymptotic behavior of mutual information estimate
The following proposition establishes the almost sure convergence and the asymptotic normality of the estimator .
Proposition 5**.**
Under the same assumptions as in Proposition 2, the following asymptotic results hold
[TABLE]
where and are given resp. by (4.1) and (4.1).
Proof.
It is straightforward to write
[TABLE]
First, we have, for large enough and for any
[TABLE]
Hence, using that , for small enough, we get, for any fixed
[TABLE]
using (3.10).
Therefore, we have asymptotically
[TABLE]
Finally, (4.16) and (4.17) follow from the Proposition 1.
This ends the proof of the Proposition 5.
∎
5. Statistic test of independence based on mutual information
The proposed mutual information estimator is a natural test statistic for independence. Given two random variables and with joint probability distribution , an hypothesis for testing the independence is
[TABLE]
versus
[TABLE]
From a random sample according to , we compute the MI estimator .
Clearly, (4.16) implies that under , , as , and a classical result in statistics (see Christensen (1997), Wilks (1938), and Fan et al. (2000)) establish that approximately follows a distribution with degrees of freedom, for short
[TABLE]
for large.
Then, at significance level , we reject the null hypothesis , when is greater than the -th quantile of .
6. Simulation
In this section, we start by providing a numerical example to illustrate asymptotic behavior of the different joint entropy measures defined before.
For simplicity consider two discretes random variables and having each one two outcomes and and such that
[TABLE]
So that the associated random variable , defined by (2.2) and (2.3), is a discrete random variable whose probability distribution is that of a discrete Zipf distributions with parameter and . Its p.m.f. is defined by
[TABLE]
where refers to the generalized harmonic function.
We have
[TABLE]
is more uncertainty than and the pair is less uncertainty than the discrete uniform distribution with range and which entropy is . The variables and seem not to have a lot of information in common, only of information.
The Table 2 defines the probability distribution , of .
In our applications we simulated i.i.d. samples of size ( according to , and computed the joint entropy estimates.
Figure 2, concerns JSE estimate, Figure 3 concerns JRE and JTE estimates (both of order ) , whereas Figure 4 concerns MI estimate, all of the pair .
In each of these Figures, left panels represent plot of the proposed entropy estimator, built from sample sizes of , and the true entropy of the pair (represented by horizontal black line). We observe that when the sample sizes increase, then the proposed estimator value converges almost surely to the true value.
Middle panels show the histogram of the sample and where the red line represents the plot of the theoretical normal distribution calculated from the same mean and the same standard deviation of the sample.
Right panels concern the Q-Q plot of the sample which display the observed values against normally distributed data (represented by the red line). We observe that the underlying distribution of the data is normal since the points fall along a straight line.
7. Conclusion
In this paper, we presented a new method for estimating the joint p.m.f. of a pair of discrete random variables. We adopted the plug-in method to construct estimates of joint shannon, Reyni and Tsallis entropies, and that of mutual information of a ordered pair of random variables. We established almost-sure rates of convergence and asymptotic normality of these estimators.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Carter [2014] Carter, Tom (March 2014). An introduction to information theory and entropy (PDF). Santa Fe.
- 2Rényi [1960] Rényi, A. (1960), On measures of information and entropy, Proc. 4th Berkeley Symposium on Mathematics, Statistics and Probability , pp 547-561.
- 3Kraskov et al. [2004] Kraskov A, Stógbauer H, Grassberger P (2004). Estimating mutual information . Phys Rev E 69:066138.
- 4Philippatos & \& Wilson [1972] Philippatos, G.C.; Wilson, C.J. (1972). Entropy, market risk, and the selection of efficient portfolios. Appl. Econ. , 4 , pp. 209–220.
- 5Moon and al. [2017] Moon KR, Sricharan K, Hero AO (2017). Ensemble estimation of mutual information. IEEE International Symposium on Information Theory (ISIT) , eds Durisi G, Studer C (IEEE, Aachen, Germany), pp 3030–3034.
- 6Timme & \& Lapish [2018] Timme NM, Lapish C.(2018). A Tutorial for Information Theory in Neuroscience. e Neuro. 5 (3)
- 7Krishnaswamy et al. [2014] Krishnaswamy, Matthew H Spitzer, Michael Mingueneau, Sean C Bendall, Oren Litvin, Erica Stone, Dana Peér, and Garry P Nolan (2014). Conditional density-based analysis of t cell signaling in single-cell data. Science , 346(6213):1250689.
- 8Liu et al. [2012] H. Liu, L. Wasserman, and J. D. Lafferty(2012), Exponential concentration for mutual information estimation with application to forests, in Advances in Neural Information Processing Systems , pp. 2537-2545.
