Non parametric estimation of joint, Renyi-Stallis entropies and mutual   information and asymptotic limits

Amadou Diadie Ba; Gane Samb Lo; Cheikh Tidiane Seck

arXiv:1906.06484·stat.ME·January 14, 2020

Non parametric estimation of joint, Renyi-Stallis entropies and mutual information and asymptotic limits

Amadou Diadie Ba, Gane Samb Lo, Cheikh Tidiane Seck

PDF

Open Access

TL;DR

This paper introduces a new non-parametric method for estimating joint entropies and mutual information of discrete variables, with proven consistency and asymptotic properties validated through simulations.

Contribution

It presents a novel estimator for joint probability mass functions and entropy measures, along with theoretical guarantees and empirical validation.

Findings

01

Estimator is almost surely consistent

02

Central limit theorems are established for the estimators

03

Simulation results validate the theoretical properties

Abstract

This paper proposes a new method for estimating the joint probability mass function of a pair of discrete random variables. This estimator is used to construct joint Shannon R\'enyi-Tsallis entropies, and the mutual information estimates of a pair of discrete random variables. Almost sure consistency and central limit Theorems are established. Our theorical results are validated by simulations.

Tables2

Table 1. Table 1. Illustration of the correspondance between p ( X , Y ) subscript p 𝑋 𝑌 \textbf{p}_{(X,Y)} and p Z subscript p 𝑍 \textbf{p}_{Z} .

$p_{1, 1} = p_{Z, 1}$	$\dots$	$p_{1, j} = p_{Z, j}$	$\dots$	$p_{1, s} = p_{Z, s}$
$p_{2, 1} = p_{Z, s + 1}$	$\dots$	$p_{2, j} = p_{Z, s + j}$	$\dots$	$p_{2, s} = p_{Z, 2 s}$
$⋮$	$⋮$	$⋮$	$⋮$	$⋮$
$p_{i, 1} = p_{Z, s (i - 1) + 1}$	$\dots$	$p_{i, j} = p_{Z, δ_{i}^{j}}$	$\dots$	$p_{i, s} = p_{Z, s i}$
$⋮$	$⋮$	$⋮$	$⋮$	$⋮$
$p_{r, 1} = p_{Z, s (r - 1) + 1}$	$\dots$	$p_{r, j} = p_{Z, s (r - 1) + j}$	$\dots$	$p_{r, s} = p_{Z, r s}$

Table 2. Table 2. Joint p.m.f. table of the random variable Z 𝑍 Z with law p ( X , Y ) subscript p 𝑋 𝑌 \textbf{p}_{(X,Y)}

$(X, Y)$	$(x_{1}, y_{1})$	$(x_{1}, y_{2})$	$(x_{2}, y_{1})$	$(x_{2}, y_{2})$
$p_{Z, k}$	$\frac{144}{205}$	$\frac{36}{205}$	$\frac{16}{205}$	$\frac{9}{205}$

Equations193

I (X = x_{i}, Y = y_{j}) = lo g_{2} \frac{1}{p _{i, j}}

I (X = x_{i}, Y = y_{j}) = lo g_{2} \frac{1}{p _{i, j}}

p_{i, j} = P (X = x_{i}, Y = y_{j}) \forall (i, j) \in I \times J = {1, \dots, r} \times {1, \dots, s} .

p_{i, j} = P (X = x_{i}, Y = y_{j}) \forall (i, j) \in I \times J = {1, \dots, r} \times {1, \dots, s} .

H (X, Y) = (i, j) \in I \times J \sum p_{i, j} lo g \frac{1}{p _{i, j}} = E_{X, Y} [lo g_{2} \frac{1}{p _{(X, Y)}}] .

H (X, Y) = (i, j) \in I \times J \sum p_{i, j} lo g \frac{1}{p _{i, j}} = E_{X, Y} [lo g_{2} \frac{1}{p _{(X, Y)}}] .

H (p_{(X, Y)}) \leq lo g (r s) .

H (p_{(X, Y)}) \leq lo g (r s) .

R_{α} (p_{(X, Y)}) = \frac{1}{1 - α} lo g (i, j) \in I \times J \sum (p_{i, j})^{α},

R_{α} (p_{(X, Y)}) = \frac{1}{1 - α} lo g (i, j) \in I \times J \sum (p_{i, j})^{α},

T_{α} (p_{(X, Y)})

T_{α} (p_{(X, Y)})

I (p_{(X, Y)}) = (i, j) \in I \times J \sum p_{i, j} lo g \frac{p _{i, j}}{p _{X, i} p _{Y, j}},

I (p_{(X, Y)}) = (i, j) \in I \times J \sum p_{i, j} lo g \frac{p _{i, j}}{p _{X, i} p _{Y, j}},

R_{α} (p_{(X, Y)})

R_{α} (p_{(X, Y)})

and T_{α} (p_{(X, Y)})

S_{α} (p_{(X, Y)}) = (i, j) \in I \times J \sum (p_{i, j})^{α} .

S_{α} (p_{(X, Y)}) = (i, j) \in I \times J \sum (p_{i, j})^{α} .

I (p_{(X, Y)}) = H (p_{X}) + H (p_{Y}) - H (p_{(X, Y)}),

I (p_{(X, Y)}) = H (p_{X}) + H (p_{Y}) - H (p_{(X, Y)}),

H(\textbf{p}_{X})=\sum_{i\in I}(p_{X,i})^{\alpha},\ \\ \ \

H(\textbf{p}_{X})=\sum_{i\in I}(p_{X,i})^{\alpha},\ \\ \ \

n \to + \infty lim I (p_{(X, Y)}^{(n)}) = a . s . I (p_{(X, Y)})

n \to + \infty lim I (p_{(X, Y)}^{(n)}) = a . s . I (p_{(X, Y)})

n \to + \infty lim E (I (p_{(X, Y)}^{(n)}) - I (p_{(X, Y)}))^{2} = 0

n \to + \infty lim P (I (p_{(X, Y)}^{(n)}) - I (p_{(X, Y)}) > ε) = 0.

n \to + \infty lim P (I (p_{(X, Y)}^{(n)}) - I (p_{(X, Y)}) > ε) = 0.

n \to + \infty lim E (I (p_{(X, Y)}^{(n)})) = I (p_{(X, Y)}) .

n \to + \infty lim E (I (p_{(X, Y)}^{(n)})) = I (p_{(X, Y)}) .

n \to + \infty lim E (I (p_{(X, Y)}^{(n)})) = I (p_{(X, Y)}),

n \to + \infty lim E (I (p_{(X, Y)}^{(n)})) = I (p_{(X, Y)}),

n \to + \infty lim V ar (I (p_{(X, Y)}^{(n)})) = 0.

n \to + \infty lim V ar (I (p_{(X, Y)}^{(n)})) = 0.

I (p_{(X, Y)}) = \frac{1}{2 lo g 2} (i, j) \in I \times J \sum \frac{( p _{i, j} - p _{X, i} p _{Y, j} ) ^{2}}{p _{X, i} p _{Y, j}}

I (p_{(X, Y)}) = \frac{1}{2 lo g 2} (i, j) \in I \times J \sum \frac{( p _{i, j} - p _{X, i} p _{Y, j} ) ^{2}}{p _{X, i} p _{Y, j}}

E (I (p_{(X, Y)}^{(n)}))

E (I (p_{(X, Y)}^{(n)}))

x_{1}, x_{2}, \dots, x_{r} and y_{1}, y_{2}, \dots, y_{s}

x_{1}, x_{2}, \dots, x_{r} and y_{1}, y_{2}, \dots, y_{s}

z_{1}, z_{2}, z_{3}, z_{4} \dots, z_{r s} .

z_{1}, z_{2}, z_{3}, z_{4} \dots, z_{r s} .

(1 + ⌊ \frac{k - 1}{s} ⌋, k - s ⌊ \frac{k - 1}{s} ⌋) \in I \times J,

(1 + ⌊ \frac{k - 1}{s} ⌋, k - s ⌊ \frac{k - 1}{s} ⌋) \in I \times J,

P (X = x_{i}, Y = y_{j}) = P (Z = z_{δ_{i}^{j}}), where δ_{i}^{j} = s (i - 1) + j,

P (X = x_{i}, Y = y_{j}) = P (Z = z_{δ_{i}^{j}}), where δ_{i}^{j} = s (i - 1) + j,

P (Z = z_{k}) = P (X = x_{1 + ⌊ \frac{k - 1}{s} ⌋}, Y = y_{k - s ⌊ \frac{k - 1}{s} ⌋}) .

P (Z = z_{k}) = P (X = x_{1 + ⌊ \frac{k - 1}{s} ⌋}, Y = y_{k - s ⌊ \frac{k - 1}{s} ⌋}) .

p_{i, j} = p_{Z, s (i - 1) + j}

p_{i, j} = p_{Z, s (i - 1) + j}

p_{Z, k} = p_{1 + ⌊ \frac{k - 1}{s} ⌋, k - s ⌊ \frac{k - 1}{s} ⌋} .

p_{Z, k} = p_{1 + ⌊ \frac{k - 1}{s} ⌋, k - s ⌊ \frac{k - 1}{s} ⌋} .

p_{X, i}

p_{X, i}

\begin{array}[]{cccccc}p_{Z,1}=p_{1,1}&p_{Z,2}=p_{1,2}&\cdots&p_{Z,k}=p_{1+\lfloor\frac{k-1}{s}\rfloor,k-s\lfloor\frac{k-1}{s}\rfloor}&\cdots&p_{Z,rs}=p_{r,s}\end{array}

\begin{array}[]{cccccc}p_{Z,1}=p_{1,1}&p_{Z,2}=p_{1,2}&\cdots&p_{Z,k}=p_{1+\lfloor\frac{k-1}{s}\rfloor,k-s\lfloor\frac{k-1}{s}\rfloor}&\cdots&p_{Z,rs}=p_{r,s}\end{array}

H (p_{(X, Y)}) = - k \in K \sum p_{Z, k} lo g p_{Z, k}, R_{α} (p_{(X, Y)}) = \frac{1}{1 - α} lo g (k \in K \sum (p_{Z, k})^{α}),

H (p_{(X, Y)}) = - k \in K \sum p_{Z, k} lo g p_{Z, k}, R_{α} (p_{(X, Y)}) = \frac{1}{1 - α} lo g (k \in K \sum (p_{Z, k})^{α}),

T_{α} (p_{(X, Y)}) = \frac{1}{1 - α} (k \in K \sum (p_{Z, k})^{α} - 1),

p_{Z, k} > 0, \forall k \in K (BD) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Mechanics and Entropy · Statistical Distribution Estimation and Applications · Complex Systems and Time Series Analysis

Full text

Joint, Renyi-Stallis entropies and mutual information and asymptotic limits.

Abstract.

This paper proposes a new method for estimating the joint probability mass function of a pair of discrete random variables. This estimator is used to construct joint Shannon Rényi-Tsallis entropies, and the mutual information estimates of a pair of discrete random variables. Almost sure consistency and central limit Theorems are established. Our theorical results are validated by simulations.

Amadou Diadie Ba*(1), Gane Samb Lo(1,2,3), Cheikh Tidiane Seck(4)*.

(1) LERSTAD, Université Gaston Berger, Sénégal,

(2) Associate Researcher, LASTA, Pierre et Marie University, Paris, FRANCE

(3) Assiated Professor, African University of Sciences and Technology, Abuja, NIGERIA.

(4) Université Alioune Diop de Bambey, Sénégal.

Correspondence : Amadou Diadie Ba,

**2010 Mathematics Subject Classifications : 94A17, 41A25, 62G05, 62G20, 62H12, 62H17.

**

Key Words and Phrases : Joint entropy estimation, Joint Rényi, Tsallis entropy, Mutual information estimation.

1. Introduction

1.1. Motivation

Let $X$ and $Y$ be two discrete random variables defined on a same probability space $(\Omega,\mathcal{A},\mathbb{P})$ , with respectives values $x_{1},\cdots,x_{r}$ and $y_{1},\cdots,y_{s}$ (with $r>1$ and $s>1$ ).

The information amount of (or content in) the outcome $(X=x_{i},Y=y_{j})$ is (see Carter (2014))

[TABLE]

where $p_{i,j}=\mathbb{P}(X=x_{i},Y=y_{j})$ .

The joint probability distribution $\textbf{p}_{(X,Y)}=(p_{i,j})_{(i,j)\in I\times J}$ of the events

$(X=x_{i},Y=y_{j})$ , coupled with the information amount of every event,

$\mathcal{I}(X=x_{i},Y=y_{j})$ , forms a random variable whose expected value is the average amount of information, or joint entropy (more specifically, joint Shannon entropy), generated by this joint distribution.

Definition 1.

Let $X$ and $Y$ be two discrete random variables defined on a probability space $(\Omega,\mathcal{A},\mathbb{P})$ , taking respectives values in the finite countable spaces

$X(\Omega)=\{x_{1},x_{2},\cdots,x_{r}\}$ * and $Y(\Omega)=\{y_{1},\cdots,y_{s}\}$ (with $r>1$ and $s>1$ ), and with joint probabilities mass function (p.m.f.) $\textbf{p}_{(X,Y)}=(p_{i,j})_{(i,j)\in I\times J}$ , that is,*

[TABLE]

(1) The joint Shannon entropy (JSE) of the (ordered) pair of random variables $(X,Y)$ is given by

[TABLE]

Entropy is usually measured in bits (binary information unit) (if $\log_{2}$ ), nats (if natural $\log$ ), or hartley( if $\log_{10}$ ), depending on the base of the logarithm which is used to define it.

For ease of computations and notation convenience, we use the natural logarithm, since logarithms of varying bases are related by a constant.

In what follows, $\textbf{p}_{X}=(p_{X,i})_{i\in I}$ and $\textbf{p}_{Y}=(p_{Y,j})_{j\in J}$ will (typically) denote the marginal distributions of the bivariate variable $(X,Y)$ whose distribution is denoted by $\textbf{p}_{(X,Y)}=(p_{i,j})_{(i,j)\in I\times J}$ . Additionally, entropies will be considered as functions of p.m.f.’s, since they only take into account probabilities of observing specific events.

Note that over all pair of random variables $(X,Y)$ that take on at most $rs$ values with positive probability, the ones with the largest entropy are those which are uniform on their ranges, and these random variables have entropy exactly $\log rs$ viz

[TABLE]

Inspired by the study of $\alpha$ -deformed algebras and special functions, various generalizations have been investigated.

Most notably, Rényi (1960) proposed a one parameter family of entropies extending Shannon entropy.

(b) The $\alpha-$ joint Rényi entropy (JRE) of the pair of random variables $(X,Y)$ is defined as

[TABLE]

with $\alpha>0,\ \ \alpha\neq 1$ , which, in particular, reduces to the joint Shannon entropy in the limit $\alpha\rightarrow 1$ .

(c) Also, the $\alpha-$ joint Tsallis entropy (JTE) of the pair of random variables $(X,Y)$ defined by

[TABLE]

has generated a large burst of research activities.

(d) The mutual information (MI) of the pair of random variables $(X,Y)$ defined by

[TABLE]

represents the amount of information that $Y$ reveals about $X$ (or vice versa).

Here $p_{X,i}=\sum_{j=1}^{s}p_{i,j}$ and $p_{Y,j}=\sum_{i=1}^{r}p_{i,j}$ .

In what follows, $\alpha>0,\ \ \alpha\neq 1$ . An important relation between JRE, JTE and the joint power sum (JPS) is

[TABLE]

where

[TABLE]

Mutual information is closely related to entropy by

[TABLE]

where

[TABLE]

is the entropy of $X$ and similarly for $Y$ .

This form can also be used for a Venn-diagram, as shown in Figure 1.

In this paper, our aim is to estimate directly entropies defined before by using a plug-in approach. (1.8) allows to obtain an estimator for MI by estimating $H(\textbf{p}_{X})$ , $H(\textbf{p}_{Y})$ , and $H(\textbf{p}_{(X,Y)})$ and adding them up. This corresponds to the $3H$ -principle upon which number of plug-in estimators are based (see Kraskov et al. (2004) for precisions on this principle).

In contrast, we propose in this paper a plug-in approach that is essentially based on the estimation of the joint probability distribution $\textbf{p}_{(X,Y)}$ from which, we can calculate the marginal distributions $\textbf{p}_{X}=(p_{X,i})_{(i\in I)}$ , $\textbf{p}_{Y}=(p_{Y,j})_{(j\in J)}$ and then the quantities $H(\textbf{p}_{(X,Y)})$ , $R_{\alpha}(\textbf{p}_{(X,Y)})$ , $T_{\alpha}(\textbf{p}_{(X,Y)})$ , and $I(\textbf{p}_{(X,Y)})$ .

This approach is motived by the fact that studying the joint probability distribution $\textbf{p}_{(X,Y)}$ of the pair of discrete random variables $(X,Y)$ taking values, resp., in the finite sets $\mathcal{X}=\{x_{i},i=1,\cdots,r\}$ and $\mathcal{Y}=\{y_{j},j=1,\cdots,s\}$ is equivalent to studying the probability distribution of the $rs$ mutually exclusive possible values $(x_{i},y_{j})$ of $(X,Y)$ . This allows us to transform the problem of estimating the joint discrete distribution of the pair $(X,Y)$ into the problem of estimating a simple distribution, say $\textbf{p}_{Z}$ , of a single discrete random variable $Z$ suitably defined. Given an i.i.d sample of this latter random variable $Z$ , we shall take, as an estimator of the law $p_{Z}$ , the associated empirical measure and plug it into formulas (1.1), (1.2), (1.3), and (1.4) to obtain estimates of entropies concerned.

Before going to our entropies estimation, let highlight some important applications of them. The importance of information measures transcends information theory. Indeed, since shortly after their inception, a wide variety of experimental sciences have found significant applications for joint Shannon entropy, Reyni and Tsallis entropies, and mutual information. For example,

$\bullet$

Finance Philippatos $\&$ Wilson (1972);

$\bullet$

Machine learning Moon and al. (2017);

$\bullet$

Biological sciences Timme $\&$ Lapish (2018)-Krishnaswamy et al. (2014);

$\bullet$

Statistics Liu et al. (2012)-Lewi et al. (2006)-Pál et al. (2010)-Christensen (1997);

$\bullet$

Sociology Reshef et al. (2011);

$\bullet$

Neuroscience Rieke (1999)-Schneidman et al. (2003).

Frequently, in those applications, the need arises to estimate information measures empirically : data are generated under an unknown probability law, and we would like to estimate these information measures from these ones.

1.2. Previous work

mutual information estimation from samples remains an active research problem (see Walters et al. (2009), Khan et al. (2007), and Sricharan et al. (2013), to cite a few).

Antos and Kontoyiannis (2001) defined estimator for mutual information of discrete random variables $X$ and $Y$ and showed that,

[TABLE]

provided that $I(\textbf{p}_{(X,Y)})<\infty$ .

Deemat (2013), using the histogram method and under appropriate assumptions on the tail behavior of the random variables, showed that the mutual information estimate is consistent in probability, that is, for any $\varepsilon>0$ ,

[TABLE]

This result will also be established by Gao et al. (2017a) using the Kraskov–Stogbauer –Grassberger (KSG) method and with some regular and smoothness conditions on resp. the Radon-Nikodym derivatives of $X$ and $Y$ and on the joint p.d.f. $\textbf{p}_{(X,Y)}$ and with assumptions on the joint entropy $H(\textbf{p}_{(X,Y)})$ .

Gao et al. (2017), using the Local Gaussian Density Estimation method, proved that the mutual information estimate is asymptotically unbiaised that is

[TABLE]

By the $k-$ nearest neighbors (K-NN) method, Gao et al. (2017) defined novel estimator for mutual information of mixture of random variables $(X,Y)$ . They proved that the proposed estimator is asymptotically unbiaised that is

[TABLE]

provided that $k=k(n)\rightarrow+\infty$ and $(k(n)\log n)/n\rightarrow 0$ as $n\rightarrow\infty.$

Furthermore, they proved that, if in addition $(k(n)\log n)^{2}/n\rightarrow 0$ as $n\rightarrow\infty$ , then

[TABLE]

Goebel et al. (2005) established by Taylor approximation that, in case of independence of the two random variables $X$ and $Y$ , then

[TABLE]

is a second-order approximation of the mutual information.

Then they deduced that if $I(\textbf{p}_{(X,Y)})$ is small enough, ( $<0.2$ bit) i.e. $X$ and $Y$ are independent or weakly associated random variables and $n$ sufficiently large ( $n>50$ ) then $I(\widehat{\textbf{p}}_{(X,Y)}^{(n)})$ approximately follows a gamma distribution with parameters $\alpha=\frac{(r-1)(s-1)}{2}$ and $\beta=\frac{1}{n\log 2}$ .

In this case the mean and variance are given as

[TABLE]

Xianli et al. (2018) used the Jackknife approach of the kernel with equalized bandwidth to estimate the S.m.i for a pair of discrete random variables and mixed random variables (with neither purely continuous distributions nor purely discrete distributions).

Beknazaryan et al. (2019) studied the mutual information estimation for mixed pair random variables. They developpped a kernel method to estimate the mutual information between the two random variables. The estimates enjoyed a central limit theorem under some regular conditions on the distributions.

1.3. Overview of the paper

The rest of the paper is organized as follows. In section 2, we define the auxiliary random variable $Z$ whose law is exactly the joint law of $(X,Y)$ . In section 3, we construct plug-in estimates of joint p.m.f.’s of $(X,Y)$ and estimates of JSE, JRE, JTE, and of MI. Section 4 establishes consistency and asymptotic normality properties of the estimates. Section 5 is devoted to an independence test based on mutual information. In section 6 we provide a simulation study to assess the performence of our estimators and we finish by a conclusion in section 7.

2. Construction of the random variable $Z$ with law $\textbf{p}_{(X,Y)}$

Let $X$ and $Y$ two discrete random variables defined in the same probability space $(\Omega,\mathcal{A},\mathbb{P})$ and taking the following values

[TABLE]

resp. ( $r>1$ and $s>1$ ).

In addition let $Z$ a random variable defined on the same probability space $(\Omega,\mathcal{A},\mathbb{P})$ and taking the following values :

[TABLE]

Denote $K=\{1,2,3,4\cdots,rs\}$ .

Simple computations give that for any $(i,j)\in I\times J$ , we have $s(i-1)+j=\delta_{i}^{j}\in K$ and conversely for any $k\in K$ we have

[TABLE]

where $\lfloor x\rfloor$ denotes the largest integer less or equal to $x$ .

For any possible joint values $(x_{i},y_{j})$ of the ordered pair $(X,Y)$ , we assign the single value $z_{\delta_{i}^{j}}$ of $Z$ such that

[TABLE]

and conversely, for any possible value $z_{k}$ of $Z$ , is assigned the single pair of values $\left(x_{1+\lfloor\frac{k-1}{s}\rfloor},y_{k-s\lfloor\frac{k-1}{s}\rfloor}\right)$ such that

[TABLE]

This means that for any $(i,j)\in I\times J$ , we have

[TABLE]

where $p_{Z,k}=\mathbb{P}(Z=z_{k})$ and conversely, for any $k\in K$

[TABLE]

Table 1 illustrates the correspondance between $p_{i,j}$ and $p_{Z,k}$ , for ( $i,j,k)\in I\times J\times K$ .

From there, the marginals *p.m.f.’*s $p_{X,i}$ are expressed from p.m.f.’s of the random variable $Z$ by

[TABLE]

Finally, JSE, JRE, JTE and MI are expressed simply in terms of $\textbf{p}_{Z}=(p_{Z,k})_{k\in K}$ through (2.4), that is

[TABLE]

where $\alpha>0,\ \ \alpha\neq 1$ .

We may give now the following remark :

For most of univariate or multivariate entropies, we may have computation problems. So without loss of generality, suppose

[TABLE]

If Assumption (2.7) holds, we do not have to worry about summation problems. This explain why Assumption (2.7) is systematically used in a great number of works in that topics, for example, in Hall (1987), Singh and Poczos (2014), Krishnamurthy et al. (2014), and recently Ba et al. (2019), to cite a few.

3. Estimation

In this section, we construct estimate of p.m.f. $\textbf{p}_{Z,k}$ from i.i.d. random variables according to $\textbf{p}_{Z}$ , and we give some inescapable results needed in the sequel, and finally construct the plug-in estimates of the entropies cited above.

Let $Z_{1},\cdots,Z_{n}$ be $n$ i.i.d. random variables from $Z$ and according to $\textbf{p}_{Z}$ .

Here, it is worth noting that, in the sequel, $K=\{1,2,\cdots,rs\}$ , with $r$ and $s$ integers strictly greater than $1$ . This means that $rs$ can not be a prime number so that (2.5) holds.

For a given $k\in K$ , define the easiest and most objective estimator of $p_{Z,k}$ , based on the i.i.d sample $Z_{1},\cdots,Z_{n},$ by

[TABLE]

where $1_{z_{k}}(Z_{\ell})=\begin{cases}1\ \ \text{if}\ \ Z_{\ell}=z_{k}\\ 0\ \ \text{otherwise}\end{cases}$ for a fixed $k\in K$ .

This means that, for a given $(i,j)\in I\times J$ , an estimate of $p_{i,j}$ based on the i.i.d sample $Z_{1},\cdots,Z_{n},$ according to $\textbf{p}_{Z}$ is given by

[TABLE]

where $1_{z_{\delta_{i}^{j}}}(Z_{\ell})=\begin{cases}1\ \ \text{if}\ \ Z_{\ell}=z_{\delta_{i}^{j}}\\ 0\ \ \text{otherwise}\end{cases}$ for fixed $(i,j)\in I\times J$ .

From (2.6), estimate of each of the marginals pdf’s $p_{X,i}$ and $p_{Y,j}$ are

[TABLE]

with

[TABLE]

In the following, we use equally $p_{Z,k}$ or $p_{i,j}$ since they are equal in consideration of (2.4) and (2.5) and we denote

[TABLE]

Before going further, let give some results concerning the empirical estimator (3.1).

For a given $k\in K$ , this empirical estimator $\widehat{p}_{Z,k}^{(n)}$ is strongly consistent and asymptotically normal. Precisely, for a fixed $k\in K$ , when $n$ tends to infinity,

[TABLE]

where $G_{p_{Z,k}}\stackrel{{\scriptstyle d}}{{\sim}}\mathcal{N}(0,p_{Z,k}(1-p_{Z,k}))$ .

These asymptotic properties derive from the law of large numbers and central limit theorem.

Here and in the following, $\stackrel{{\scriptstyle a.s.}}{{\longrightarrow}}$ means the almost sure convergence, $\stackrel{{\scriptstyle\mathcal{D}}}{{\rightsquigarrow}}$ , the convergence in distribution, and $\stackrel{{\scriptstyle d}}{{\sim}}$ , means equality in distribution.

Recall that, since for a fixed $k\in K,$ $n\widehat{p}_{Z,k}^{(n)}$ has a binomial distribution with parameters $n$ and success probability $p_{Z,k}$ , we have

[TABLE]

Denote

[TABLE]

where $\Delta_{p_{Z,k}}^{(n)}=\widehat{p}_{Z,k}^{(n)}-p_{Z,k}.$

By the asymptotic Gaussian limit of the multinomial law (see for example Lo (2016), Chapter 1, Section 4), we have

[TABLE]

where $G(\textbf{p}_{Z})=(G_{p_{Z,k}},k\in K)^{t}\stackrel{{\scriptstyle d}}{{\sim}}\mathcal{N}(0,\Sigma_{\textbf{p}_{Z}}),$ and $\Sigma_{\textbf{p}_{Z}}$ is the covariance matrix which elements are :

[TABLE]

By denoting $a_{X,n}=\sup_{i\in I}|\widehat{p}_{X,i}^{(n)}-p_{X,i}|\ \ \text{and}\ \ a_{Y,n}=\sup_{j\in J}|\widehat{p}_{Y,j}^{(n)}-p_{Y,j}|$ then, we have

[TABLE]

As a consequence, JSE, JRE, and JTE are estimated from the sample $Z_{1},\cdots,Z_{n}$ by their plug-in counterparts, meaning that we simply insert the consistent p.m.f. estimator $\widehat{p}_{Z,k}^{(n)}$ computed from (3.1) in place of JSE, JRE, and JTE expresions viz :

[TABLE]

where $\alpha>0,\ \ \alpha\neq 1$ and $\widehat{p}_{Z,\delta_{i}^{j}}^{(n)}$ , and $\widehat{p}_{X,i}^{(n)},$ are given resp. by (3.2), and (3.3).

In addition, define the JPS estimate

[TABLE]

In the following, we present asymptotic limits of these empirical estimators.

4. Statements of the main results

In this section, we state and prove almost sure consistency and central limit theorem for the estimators defined above.

4.1. Asymptotic limits of joint Shannon entropy estimate.

Denote

[TABLE]

Proposition 1.

Let $\textbf{p}_{(X,Y)}$ a probability distribution and $\widehat{\textbf{p}}_{(X,Y)}^{(n)}$ be generated by i.i.d samples $Z_{1},Z_{2},\cdots,Z_{n}$ according to $\textbf{p}_{(X,Y)}$ and given by (3.5), assumption (2.7) be satisfied. Then the following asymptotic results hold

[TABLE]

Proof.

Define the function $\psi:\,(0,+\infty)\rightarrow\mathbb{R}$ by $\psi(x)=x\log x$ .

Let $(i,j)\in I\times J$ , and set $k=\delta_{i}^{j}\in K$ . We have

[TABLE]

by the mean values theorem and where $\theta_{1,k}^{(n)}$ is some number lying in $(0,1)$ .

Applying again the main value Theorem to the derivative function $\psi^{\prime}$ of $\psi$ , we obtain

[TABLE]

where $\theta_{2,k}^{(n)}\in(0,1)$ . Replacing in (4.5), it yields

[TABLE]

Now summing over $(i,j)\in I\times J$ , it follows that

[TABLE]

so that

[TABLE]

Hence

[TABLE]

since, as $n\rightarrow+\infty$ ,

[TABLE]

Which proves the claim (4.4).

Going back to (4.6), we have

[TABLE]

where

[TABLE]

The asymptotic Gaussian limit of the multinomial law (3.8), garantees that

[TABLE]

where the asymptotic variance, $\sigma^{2}(\textbf{p}_{Z})$ , equals to

[TABLE]

It remains to prove that $\sqrt{n}R_{1,n}$ converges in probability to [math] as $n\rightarrow+\infty$ .

We have

[TABLE]

By the Bienaymé-Tchebychev inequality, we have, for any fixed $\epsilon>0$ and for any $k\in K$

[TABLE]

Therefore $\sqrt{n}(a_{Z,n})^{2}=o_{\mathbb{P}}(1)$ which entails that $\sqrt{n}R_{1,n}=0_{\mathbb{P}}(1)$ since, as $n$ tends to $+\infty$ , we have

[TABLE]

All this proves the claim (4.4) and ends the proof of the Proposition 1 ∎

4.2. Asymptotic limit of joint Renyi and Tsallis entropies estimates

The following proposition concerns the asymptotic limits of JPS estimate $\mathcal{S}_{\alpha}(\widehat{\textbf{p}}_{(X,Y)}^{(n)})$ given by

[TABLE]

The proof is the same as that of Proposition 1, just replace the function $\psi$ by the function $\varphi:x\mapsto x^{\alpha}.$ Hence omitted.

For $\alpha>0,\ \ \alpha\neq 1$ , denote

[TABLE]

Proposition 2.

Under the conditions as in Proposition 1, the asymptotic results hold

[TABLE]

Turning now to our second result, note that the relation (1.5) suggests that similar results of Proposition 2 could be also extended to the JRE.

For $\alpha>0,\ \ \alpha\neq 1$ , denote

[TABLE]

Proposition 3.

Under the same assumptions as in Proposition 2, the following asymptotic results hold

[TABLE]

Proof.

For $\alpha\in(0,1)\cup(1,+\infty),$ we have

[TABLE]

Using a Taylor expansion of $\log(1+y)$ it follows that almost surely,

[TABLE]

Finally this, combined with (4.8) of Proposition 2, proves the claim (4.10).

Let prove the claim (4.11).

Using the same technics as in the proof of Proposition 1, we obtain

[TABLE]

where $\varphi(x)=x^{\alpha}$ . So that dividing each member by $\sqrt{n}\mathcal{S}_{\alpha}(\textbf{p}_{(X,Y)})$ , we get

[TABLE]

Now by Taylor expansion of $\log(1+y)$ , it follows that, almost surely,

[TABLE]

thus, from (4.12), we obtain

[TABLE]

but using (4.9), we have that

[TABLE]

Finally

[TABLE]

with

[TABLE]

This proves the claim (4.11) and ends the proof of the Proposition 3 .

∎

Note also that , the relation (1.6) suggests that similar results of Proposition 2 could be also extended to the JTE.

For $\alpha>0,\ \ \alpha\neq 1$ , denote

[TABLE]

Proposition 4.

Under the same assumptions as in Proposition 2, the following asymptotic results hold

[TABLE]

Proof.

The proof follows very simply from Proposition 2, by writing

[TABLE]

∎

4.3. Asymptotic behavior of mutual information estimate

The following proposition establishes the almost sure convergence and the asymptotic normality of the estimator $I(\widehat{\textbf{p}}_{(X,Y)}^{(n)})$ .

Proposition 5.

Under the same assumptions as in Proposition 2, the following asymptotic results hold

[TABLE]

where $A_{H}(\textbf{p}_{(X,Y)})$ and $\sigma_{H}^{2}(\textbf{p}_{(X,Y)})$ are given resp. by (4.1) and (4.1).

Proof.

It is straightforward to write

[TABLE]

First, we have, for $n$ large enough and for any $(i,j)\in I\times J$

[TABLE]

Hence, using that $\log(1+x)\approx x$ , for $x$ small enough, we get, for any fixed $(i,j)\in I\times J$

[TABLE]

using (3.10).

Therefore, we have asymptotically

[TABLE]

Finally, (4.16) and (4.17) follow from the Proposition 1.

This ends the proof of the Proposition 5.

∎

5. Statistic test of independence based on mutual information

The proposed mutual information estimator is a natural test statistic for independence. Given two random variables $X$ and $Y$ with joint probability distribution $\textbf{p}_{(X,Y)}=(p_{i,j})_{(i,j)\in[1,r]\times[1,s]}$ , an hypothesis for testing the independence is

[TABLE]

versus

[TABLE]

From a random sample $Z_{1},Z_{2},\cdots,Z_{n}$ according to $\textbf{p}_{(X,Y)}$ , we compute the MI estimator $I(\widehat{\textbf{p}}_{(X,Y)}^{(n)})$ .

Clearly, (4.16) implies that under $H_{0}$ , $I(\widehat{\textbf{p}}_{(X,Y)}^{(n)})\stackrel{{\scriptstyle a.s}}{{\longrightarrow}}0$ , as $n\rightarrow\infty$ , and a classical result in statistics (see Christensen (1997), Wilks (1938), and Fan et al. (2000)) establish that $2nI(\widehat{\textbf{p}}_{(X,Y)}^{(n)})$ approximately follows a $\chi^{2}$ distribution with $(r-1)(s-1)$ degrees of freedom, for short

[TABLE]

for $n$ large.

Then, at significance level $\alpha\in(0,1)$ , we reject the null hypothesis $H_{0}$ , when $2nI(\widehat{\textbf{p}}_{(X,Y)}^{(n)})$ is greater than the $(1-\alpha)$ -th quantile of $\chi_{(r-1)(s-1)}^{2}$ .

6. Simulation

In this section, we start by providing a numerical example to illustrate asymptotic behavior of the different joint entropy measures defined before.

For simplicity consider two discretes random variables $X$ and $Y$ having each one two outcomes $x_{1},x_{2},x_{3}$ and $y_{1},y_{2}$ and such that

[TABLE]

So that the associated random variable $Z$ , defined by (2.2) and (2.3), is a discrete random variable whose probability distribution is that of a discrete Zipf distributions $Z_{\beta,m}$ with parameter $\beta=2$ and $m=4$ . Its p.m.f. is defined by

[TABLE]

where $\sum_{j=1}^{m}j^{-\beta}$ refers to the generalized harmonic function.

We have

[TABLE]

$Y$ is more uncertainty than $X$ and the pair $(X,Y)$ is less uncertainty than the discrete uniform distribution with range $[1,4]$ and which entropy is $\log 4=1.386294$ . The variables $X$ and $Y$ seem not to have a lot of information in common, only $0.0072269860\ \ \text{nat}$ of information.

The Table 2 defines the probability distribution $\textbf{p}_{Z}$ , of $Z$ .

In our applications we simulated i.i.d. samples of size $n$ ( $n=100,200,\cdots,30000)$ according to $\textbf{p}_{Z}$ , and computed the joint entropy estimates.

Figure 2, concerns JSE estimate, Figure 3 concerns JRE and JTE estimates (both of order $\alpha=2$ ) , whereas Figure 4 concerns MI estimate, all of the pair $(X,Y)$ .

In each of these Figures, left panels represent plot of the proposed entropy estimator, built from sample sizes of $n=100,200,\cdots,30000$ , and the true entropy of the pair $(X,Y)$ (represented by horizontal black line). We observe that when the sample sizes $n$ increase, then the proposed estimator value converges almost surely to the true value.

Middle panels show the histogram of the sample and where the red line represents the plot of the theoretical normal distribution calculated from the same mean and the same standard deviation of the sample.

Right panels concern the Q-Q plot of the sample which display the observed values against normally distributed data (represented by the red line). We observe that the underlying distribution of the data is normal since the points fall along a straight line.

7. Conclusion

In this paper, we presented a new method for estimating the joint p.m.f. of a pair of discrete random variables. We adopted the plug-in method to construct estimates of joint shannon, Reyni and Tsallis entropies, and that of mutual information of a ordered pair of random variables. We established almost-sure rates of convergence and asymptotic normality of these estimators.

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Carter [2014] Carter, Tom (March 2014). An introduction to information theory and entropy (PDF). Santa Fe.
2Rényi [1960] Rényi, A. (1960), On measures of information and entropy, Proc. 4th Berkeley Symposium on Mathematics, Statistics and Probability , pp 547-561.
3Kraskov et al. [2004] Kraskov A, Stógbauer H, Grassberger P (2004). Estimating mutual information . Phys Rev E 69:066138.
4Philippatos & \& Wilson [1972] Philippatos, G.C.; Wilson, C.J. (1972). Entropy, market risk, and the selection of efficient portfolios. Appl. Econ. , 4 , pp. 209–220.
5Moon and al. [2017] Moon KR, Sricharan K, Hero AO (2017). Ensemble estimation of mutual information. IEEE International Symposium on Information Theory (ISIT) , eds Durisi G, Studer C (IEEE, Aachen, Germany), pp 3030–3034.
6Timme & \& Lapish [2018] Timme NM, Lapish C.(2018). A Tutorial for Information Theory in Neuroscience. e Neuro. 5 (3)
7Krishnaswamy et al. [2014] Krishnaswamy, Matthew H Spitzer, Michael Mingueneau, Sean C Bendall, Oren Litvin, Erica Stone, Dana Peér, and Garry P Nolan (2014). Conditional density-based analysis of t cell signaling in single-cell data. Science , 346(6213):1250689.
8Liu et al. [2012] H. Liu, L. Wasserman, and J. D. Lafferty(2012), Exponential concentration for mutual information estimation with application to forests, in Advances in Neural Information Processing Systems , pp. 2537-2545.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Joint, Renyi-Stallis entropies and mutual information and asymptotic limits.

Abstract.

1. Introduction

1.1. Motivation

Definition 1**.**

1.2. Previous work

1.3. Overview of the paper

2. Construction of the random variable ZZZ with law p(X,Y)\textbf{p}_{(X,Y)}p(X,Y)​

3. Estimation

4. Statements of the main results

4.1. Asymptotic limits of joint Shannon entropy estimate.

Proposition 1**.**

Proof.

4.2. Asymptotic limit of joint Renyi and Tsallis entropies estimates

Proposition 2**.**

Proposition 3**.**

Proof.

Proposition 4**.**

Proof.

4.3. Asymptotic behavior of mutual information estimate

Proposition 5**.**

Proof.

5. Statistic test of independence based on mutual information

6. Simulation

7. Conclusion

Definition 1.

2. Construction of the random variable $Z$ with law $\textbf{p}_{(X,Y)}$

Proposition 1.

Proposition 2.

Proposition 3.

Proposition 4.

Proposition 5.