Shannon Shakes Hands with Chernoff: Big Data Viewpoint On Channel   Information Measures

Shanyun Liu; Rui She; Jiaxun Lu; Pingyi Fan

arXiv:1701.03237·cs.IT·January 13, 2017

Shannon Shakes Hands with Chernoff: Big Data Viewpoint On Channel Information Measures

Shanyun Liu, Rui She, Jiaxun Lu, Pingyi Fan

PDF

Open Access

TL;DR

This paper reexamines Shannon, Renyi, and Chernoff information measures from a big data perspective using the ACE algorithm, revealing their similarities and proposing a conjecture about channel information.

Contribution

It introduces a big data viewpoint to compare Shannon, Renyi, and Chernoff measures and proposes a conjecture on channel information depending solely on channel parameters.

Findings

01

Shannon and Chernoff mutual information decompositions are nearly identical.

02

Shannon and Chernoff measures effectively represent the same information.

03

A conjecture that channel information is determined solely by channel parameters.

Abstract

Shannon entropy is the most crucial foundation of Information Theory, which has been proven to be effective in many fields such as communications. Renyi entropy and Chernoff information are other two popular measures of information with wide applications. The mutual information is effective to measure the channel information for the fact that it reflects the relation between output variables and input variables. In this paper, we reexamine these channel information measures in big data viewpoint by means of ACE algorithm. The simulated results show us that decomposition results of Shannon and Chernoff mutual information with respect to channel parametersare almost the same. In this sense, Shannon shakes hands with Chernoff since they are different measures of the same information quantity. We also propose a conjecture that there is nature of channel information which is only decided by…

Equations38

I_{S} (X, Y) = D (p (x, y) ∣∣ p (x) p (y))

I_{S} (X, Y) = D (p (x, y) ∣∣ p (x) p (y))

= x, y \sum p (x, y) l o g \frac{p ( x , y )}{p ( x ) p ( y )}

D (p ∣∣ q) = x \in X \sum p (x) l o g \frac{p ( x )}{q ( x )}

D (p ∣∣ q) = x \in X \sum p (x) l o g \frac{p ( x )}{q ( x )}

\frac{I _{C} ( P _{1} , P _{2} ) \buildrel Δ}{= - 0 \leq α \leq 1 min lo g ( x \sum P _{1}^{α} ( x ) P _{2}^{1 - α} ( x ) )}

\frac{I _{C} ( P _{1} , P _{2} ) \buildrel Δ}{= - 0 \leq α \leq 1 min lo g ( x \sum P _{1}^{α} ( x ) P _{2}^{1 - α} ( x ) )}

\frac{I _{C} ( X , Y ) \buildrel Δ}{= - 0 \leq α \leq 1 min lo g ( x \sum y \sum p ^{α} ( x , y ) ( p ( x ) p ( y )) ^{1 - α} ) .}

\frac{I _{C} ( X , Y ) \buildrel Δ}{= - 0 \leq α \leq 1 min lo g ( x \sum y \sum p ^{α} ( x , y ) ( p ( x ) p ( y )) ^{1 - α} ) .}

I_{S} (X; Y) = H (Y) - H (Y ∣ X)

I_{S} (X; Y) = H (Y) - H (Y ∣ X)

= k = 1 \sum 2 j = 1 \sum 2 p (x_{k}, y_{j}) l o g \frac{p ( x _{k} , y _{j} )}{p ( x _{k} ) p ( y _{j} )}

= (1 - ε) lo g \frac{1 - ε}{λ ( 1 - ε ) + ( 1 - λ ) ε} + ε lo g \frac{ε}{λ ε + ( 1 - λ ) ( 1 - ε )} .

I_{C} (X; Y) = - 0 \leq α \leq 1 min lo g (x \sum P_{1}^{α} (x) P_{2}^{1 - α} (x))

I_{C} (X; Y) = - 0 \leq α \leq 1 min lo g (x \sum P_{1}^{α} (x) P_{2}^{1 - α} (x))

= - 0 \leq α \leq 1 min lo g ((1 - ε)^{α} (λ (1 - ε) + (1 - λ) ε)^{1 - α}

+ ε^{α} (λ ε + (1 - λ) (1 - ε))^{1 - α})

I_{S} (X; Y) = D (p (x, y) ∣∣ p (x) p (y))

I_{S} (X; Y) = D (p (x, y) ∣∣ p (x) p (y))

= k = 1 \sum M j = 1 \sum M p (x_{k}, y_{j}) l o g \frac{p ( x _{k} , y _{j} )}{p ( x _{k} ) p ( y _{j} )}

I_{C} (X; Y) = - 0 \leq α \leq 1 min lo g (x \sum p (x y)^{α} (x) (p (x) p (y))^{1 - α} (x))

I_{C} (X; Y) = - 0 \leq α \leq 1 min lo g (x \sum p (x y)^{α} (x) (p (x) p (y))^{1 - α} (x))

e^{2} (θ, ϕ_{1}, \dots, ϕ_{p}) = \frac{E { [ θ ( Y ) - i = 1 \sum p ϕ _{i} ( X _{i} ) ] ^{2} }}{E [ θ ^{2} ( Y ) ]}

e^{2} (θ, ϕ_{1}, \dots, ϕ_{p}) = \frac{E { [ θ ( Y ) - i = 1 \sum p ϕ _{i} ( X _{i} ) ] ^{2} }}{E [ θ ^{2} ( Y ) ]}

ϕ_{k, 1} (X_{k}) = E θ (Y) - i \neq = k \sum ϕ_{i} (X_{i}) ∣ X_{k};

ϕ_{k, 1} (X_{k}) = E θ (Y) - i \neq = k \sum ϕ_{i} (X_{i}) ∣ X_{k};

θ_{1} (Y) = \frac{E [ i = 1 \sum ϕ _{i} ( X _{i} ) ∣ Y ]}{E [ i = 1 \sum p ϕ _{i} ( X _{i} ) ∣ Y ]};

θ_{1} (Y) = \frac{E [ i = 1 \sum ϕ _{i} ( X _{i} ) ∣ Y ]}{E [ i = 1 \sum p ϕ _{i} ( X _{i} ) ∣ Y ]};

Y = f (X_{1}, X_{2}, ..., X_{p}) .

Y = f (X_{1}, X_{2}, ..., X_{p}) .

θ (Y) = i = 1 \sum p ϕ_{i} (X_{i}) + δ

θ (Y) = i = 1 \sum p ϕ_{i} (X_{i}) + δ

Y = f (X) = (a) θ^{- 1} (i = 1 \sum p ϕ_{i} (X_{i})) = ψ (φ (X))

Y = f (X) = (a) θ^{- 1} (i = 1 \sum p ϕ_{i} (X_{i})) = ψ (φ (X))

Ψ = φ (X)

Ψ = φ (X)

I = ψ (Ψ)

I = ψ (Ψ)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputability, Logic, AI Algorithms · Neural Networks and Applications · Evolutionary Algorithms and Applications

Full text

Shannon Shakes Hands with Chernoff: Big Data Viewpoint On Channel Information Measures

Shanyun Liu, Rui She, Jiaxun Lu, Pingyi Fan

Tsinghua National Laboratory for Information Science and Technology(TNList),

Department of Electronic Engineering, Tsinghua University, Beijing, P.R. China

E-mail: [email protected], [email protected], [email protected], [email protected]

Abstract

Shannon entropy is the most crucial foundation of Information Theory, which has been proven to be effective in many fields such as communications. Rényi entropy and Chernoff information are other two popular measures of information with wide applications. The mutual information is effective to measure the channel information for the fact that it reflects the relation between output variables and input variables. In this paper, we reexamine these channel information measures in big data viewpoint by means of ACE algorithm. The simulated results show us that decomposition results of Shannon and Chernoff mutual information with respect to channel parameters are almost the same. In this sense, Shannon shakes hands with Chernoff since they are different measures of the same information quantity. We also propose a conjecture that there is nature of channel information which is only decided by the channel parameters.

Index Terms:

Shannon Information;Chernoff Information; Rényi Divergence; ACE; Big Data

I Introduction

Shannon entropy is the most crucial foundation of Information Theory. In this traditional Information Theory and its applications, relative entropy (or Kullback-Leibler distance) is the basic information divergence. Shannon Information Theory is useful for us to develop communications, data science and other subjects about information [1]. Due to its success, there are a lot of literature which attempts to advance these concepts. Unfortunately, nearly none of them have been widely adopted except the the Rényi entropy [2][3]. The Rényi entropy has a wide range of applications. For example, according to [4], it is useful to adopt Rényi entropy in the traditional signal processing. Besides that, the Rényi entropy was also used in independent component analysis (ICA) with the fact that the information measures based on Rényi entropy could provide the distance measures among a cluster of probability densities [5]. Moreover, some studies suggested that it could help us to do blind source separation (BSS) [6][7]. The Chernoff information was developed to measure the information based on Rényi entropy [3]. In fact, the Chernoff information is the exponent when minimizing the overall probability of error, so it is very useful in hypothesis testing [8].

In the last 68 years of Information Theory development, a large amount of concepts have been provided to describe the the channel information. Among them, the most popular one is channel capacity. As a matter of fact, the channel capacity is defined as maximization mutual information. The mutual information is a measure of the amount of information and it is the reduction in uncertainty of one random variable when given the other variable [8]. With respect to (w.r.t.) channel capacity, the mutual information is easier to use and expand, without loss of the ability to measure the channel ”information”. As a result, in order to be promoted to other non-Shannon entropy, the mutual information between channel input and output is adopted in this paper.

Unlike traditional research method of Information Theory, we attempt to revisit the traditional problem in big data viewpoint. Obviously, the era of big data is coming. Tha large amount of data can be easily gotten and the data processing is more and more important for us. Two data rules are believed. For one thing, the data does not lie. For other thing, we only need to guarantee that our conclusion is correct with very large probability rather than one in the era of big data [9]. In this paper, we choose the alternating conditional expectation (ACE) algorithm as the tool to deal with data.

The ACE algorithm was proposed by Breiman and Friedman (1985) [10] for estimating the transformations of dependent variable and a set of independent variables in multiple regression that estimate maximal correlation among the these variables [11]. It can help us analyze multivariate function for it enabling multiple variable separation. Its effectiveness and correctness were provided in [10]. With the ACE algorithm, it is easy to separate the effect of different channel parameters so as to find the nature connection between channel information and the channel parameters. Furthermore, one can investigate the relation between Shannon and Chernoff as well.

The main contribution of this paper can be summarized as follows. We decompose the channel mutual information of Shannon and Chernoff w.r.t channel parameters by ACE algorithm. The simulated results show us that Shannon shakes hands with Chernoff in big data viewpoint. Based on these results, we put forward a conjecture that there is nature of channel information and no matter Shannon mutual information, Chernoff mutual information or other information measures, which are different measures of the same information quantity. This conclusion can help us to construct new information measures or judge a new channel information measure is reasonable or not.

I-A Introduction of mutual Information

I-A1 Shannon mutual Information

The mutual information is a measure of the amount of information that one random variable contains about another variable. In fact, it describes the reduction in the uncertainty of one random variable if another one is known [8]. According to [8], the Shannon mutual information is defined as:

[TABLE]

where $X,Y$ are two random variables with marginal probability mass functions $p(x)$ and $p(y)$ and joint probability mass function $p(x,y)$ . It is noted that $I(X;Y)$ is the relative entropy (or Kullback-leibler distance) between the joint distribution $p(x,y)$ and the product distribution $p(x)p(y)$ . The relative entropy is given by [8]:

[TABLE]

I-A2 Chernoff mutual Information

The Chernoff information is derived from the problem of classic hypothesis testing. Chernoff Information is the resulting error exponent when minimizing the overall probability of error [8]. As defined in [8], it is given by

[TABLE]

where $\alpha$ is real parameter with $0\leq\alpha\leq 1$ .

With the aid of Chernoff information to describe the relationship between the two variables, a new mutual information is defined as

[TABLE]

In this paper, we call it Chernoff mutual information relative to Shannon mutual information.

In fact, both Shannon mutual information and Chernoff mutual information are to depict the inner relationship between two random variables. The difference is that they adopt different measures of the distance between two distributions. The relative entropy is adopted in Shannon mutual information and the Rényi divergence is used in Chernoff mutual information [3].

I-B Outline of the Paper

The rest of this paper is organized as follows. Section II gives a brief introduction of the several special channel models. In this section, we lists binary symmetric channel (BSC) and multiple symmetric channel (MSC). Furthermore, this section also give a brief introduction of the ACE algorithm. In Section III, some simulated examples are given. Next, Section IV gives the analysis of the simulated examples. Finally, we present the conclusion in Section V.

II Channel Information Measures and ACE Algorithm Description

The mutual information between the input and output of a channel can be used to describe the transmittability of a channel. For example, the maximum mutual information is defined as the channel capacity for a discrete memoryless channel. Obviously, the mutual information is the measure of channel information.

II-A Binary Symmetric Channel

Consider the BSC, which is shown in Fig.1. It is a binary channel where the probability of the input symbols is $(\lambda,1-\lambda)$ . The transmission error in it is $\varepsilon$ . It is argued in [8] that it reflects common characteristics of the general channel with errors though it is the simplest model.

The Shannon mutual information is given by:

[TABLE]

It can be seen as $I_{S}(X,Y)=D(p(x,y)||p(x)p(y))$ . While the Chernoff mutual information is given by:

[TABLE]

Unfortunately, there is no explicit solution for Eq.(6). As a result, it is arduous to analyze the Chernoff mutual information.

It is observed that the Shannon mutual information seems to be completely different from the Chernoff mutual information.

II-B Multiple Symmetric Channel

The MSC is shown in Fig.2. The number of input symbol is $M$ . The probability of the input symbol is $(\lambda_{1},\lambda_{2},...,\lambda_{M})$ with $\lambda_{1}+\lambda_{2}+...+\lambda_{M}=1$ ( $\lambda_{i}>0$ ).The transmission error of each one is $\frac{\varepsilon}{M-1}$ . The BSC is the special case of MSC for $M=2$ .

Similarly, the Shannon mutual information is given by:

[TABLE]

The Chernoff mutual information is given by:

[TABLE]

II-C ACE Algorithm Description

Much of research in regression analysis has examined the optimal transformation between one or more predictors and the response. Unlike traditional multiple regression algorithm which requires the priori information of the functional forms, the ACE algorithm of Breiman and Friedman (1985) in [10] does not require that and it is non-parametric transformation. That is to say, it is a fully automated algorithm to estimate the optimal transformation between predictors and response. Furthermore, it can be also used to estimate maximal correlation among random variables. The implementation of the ACE algorithm can consult [10][11].

Assume random variables $X_{1},X_{2},\cdots,X_{p}$ are predictors and $Y$ is response. Supposing $\phi_{1}(X_{1}),\phi_{2}(X_{2}),\cdots,\phi_{p}(X_{p}),\theta(Y)$ are arbitrary zero mean functions of the corresponding variables, the residual error is

[TABLE]

The algorithm is summarized in Alg.1 as in [10].

In this paper, we use the ACE algorithm help us separate multivariate function. Let $X_{1},X_{2},\cdots,X_{p}$ be the channel parameters and $Y$ be measured value of channel information. The functional relation between them is:

[TABLE]

It is supposed that ${X_{1}},{X_{2}},...,{X_{p}}$ are known and independent. Moreover, the functional relation $f$ is also known since it is defined by human to describe the channel information. Therefore, it is easy to get $Y$ and they form a data set $\{Y,X_{1},\cdots,X_{p}\}$ . This data set meets the precondition of the ACE algorithm, so ACE algorithm can be used to analyze them. As a result, one can get

[TABLE]

where $\delta$ is residual error. In this case, it is easy to find out the separate influence of each correspondent channel parameter.

III Simulated Example

In this section, various numerical simulation results will be presented to analyze the two channel information measures. We focus on conducting the Monte Carlo simulation by computer to compare the Shannon and Chernoff mutual information in different channels by ACE algorithm. The procedure is given by the following simulation procedure.

III-A Binary Symmetric Channel

$20000$ observations geneated from the Eq.(5) and Eq.(6) where $\lambda$ is the probability of the input symbol and $\varepsilon$ is the transmission error. $\lambda$ and $\varepsilon$ are independently drawn from a uniform distribution $U(0,1)$ . In this case, our channel information is a multivariate case with two input parameters.

The ACE algorithm is applied to this simulated data set and the results is shown in Fig.3. The correlation between $\theta(y)$ and $\phi_{1}(\lambda)+\phi_{2}(\varepsilon)$ is extremely close to $1$ . Furthermore, The error of the ACE decomposition $\delta$ is invariably near to zero. Clearly, both of them have shown that the ACE decomposition results are excellent.

It is noted that function curve of $\phi_{1}(\lambda)$ and $\phi_{2}(\varepsilon)$ w.r.t. Shannon mutual information is almost coincided with that w.r.t Chernoff mutual information after the ACE decomposition. As a result, in the range of the errors permitted, the $\phi_{1}(\lambda)$ and $\phi_{2}(\varepsilon)$ appears to be coincided with each other. $\phi_{1}(\lambda)$ is monotone decreasing function. The greater $\lambda$ , the faster the decline in $\phi_{1}(\lambda)$ . Furthermore, $\phi_{2}(\varepsilon)$ is symmetric around the line $\varepsilon=0.5$ for the fact that the channel is symmetrical about $\varepsilon$ .

The results of Shannon mutual information are smaller than the Chernoff mutual information, but whole variant trend is identical. $\theta(y)$ increases rapidly when $y$ is close to zero and the slope of its curve is becoming smaller as $y$ increases.

III-B Multiple Symmetric Channel

$60000$ observations generated from the Eq.(7) and Eq.(8) where $\lambda_{1},\lambda_{2},\lambda_{3}$ is the probability of the input symbol and $\varepsilon$ is the transmission error. They are generated randomly and independently, which is bound in $(0,1)$ . In this case, our channel information is a multivariate case with four input parameters.

Fig.4 shows similar characteristics that in Fig.3. The value of the ACE decomposition error $\delta$ is still very small. On the other hand, the correlation between $\theta(y)$ and $\phi_{1}(\lambda_{1})+\phi_{2}(\lambda_{2})+\phi_{3}(\lambda_{3})+\phi_{4}(\varepsilon)$ is close to $1$ . These two points verify the validity of the ACE algorithm again.

It is obviously that the function curves of $\theta(y)$ , $\phi_{1}(\lambda_{1})$ , $\phi_{2}(\lambda_{2})$ , $\phi_{3}(\lambda_{3})$ and $\phi_{4}(\varepsilon)$ are almost the same for these two channel information measures. It is worth noting that the minimum of $\phi_{4}(\varepsilon)$ occurs when $\varepsilon\approx 0.75$ . The values of $\phi_{4}(\varepsilon)$ is big when $\varepsilon$ is very close to [math] or $1$ . Moreover, $\phi_{1}(\lambda_{1})$ , $\phi_{2}(\lambda_{2})$ and $\phi_{3}(\lambda_{3})$ are very similar. Their function curve is flat and close to zero when the independent variable (the probability of the input symbol) is less than $0.7$ . When the independent variable approaches to $1$ , the values of them rapid decrease. In fact, it is reasonable because these input variables are equivalent for the channel.

As illustrated in Fig.4, the result of Shannon mutual information is smaller than the Chernoff mutual information, but whole varying trend is identical. $\theta(y)$ increases rapidly when $y$ is also close to zero and the slope of its curve is becoming smaller as $y$ increases.

IV Discusssion and Physical Explanation

IV-A Shannon Shakes Hands with Chernoff

When more narrowly examined, there are more interesting conclusions. In this section, we analyze these ACE simulated results from the viewpoint of big data. There are several nature behind it. First of all, the data does not lie. Naturally, the ACE results can provide much true and useful information for us. Furthermore, we always turn to numerical analysis when it is difficult to do theoretical analysis. The complicated equations of Shannon or Chernoff mutual information are arduous to do intuitive theory analysis, especially when they are multivariable, so it is quite appropriate to do numerical analysis by ACE algorithm. In fact, the probably approximately correct (PAC) model is enough for us in most of time [9]. In this model, one only need to construct algorithms which guarantee that it is correct with high probability (not necessarily one). This model is the foundation of Support Vector Machine (SVM) [9].

The ACE results is

[TABLE]

where $Y$ is the channel information and X is all the set of all the input channel parameters. The equation ( $a$ ) holds because the residual error $\delta$ can be ignored. From the figures above, it is noted that $\phi_{i}(X_{i})$ is almost the same for both Shannon and Chernoff mutual information. For these two measures, the only difference is $\theta(\cdot)$ . The function $\psi(\cdot)$ is different and ${\varphi\left({{\textbf{X}}}\right)}$ is the same. From the Fig.3 to Fig.4, it is also noted that the function $\theta(\cdot)$ is monotonic function. As mentioned above, the curves of $\theta(\cdot)$ have the same trend and the values of $\theta(Y)$ for Shannon mutual information is invariably smaller than those for Chernoff information. Therefore, the Shannon’s $\theta^{-1}(\cdot)$ is bigger than Chernoff’s. Actually, the facts extremely agree with the this corollary, which can well explain the phenomena in Fig.5.

In this sense, Shannon shakes hands with Chernoff. Over the last decades, Shannon information and Chernoff information are always considered to be two kinds of channel information measures and the relation between them is that Kullback-Leibler divergence is the special case of Rényi divergence when $\alpha=1$ . The channel information measures deduced from them look very different even for the channel as simple as BSC. However, with the aid of the ACE algorithm, we find the inner relations between them. In fact, they are different metrics of the same physical quantity. For a channel, in a certain time, if all of its parameters are given, all of its state and properties will be determined. Thus the information about the same channel is always consistent. For example, the kilometer and centimeter are both measures of length, but they are used in different situations for convenience. Generally speaking, we do not adopt the centimeter to measure the distance between two cities. In contrast, we would like adopt it to measure the length of the steel rule. Their true values are identical when measuring the same things, even their numerical values are different. Similarly, the Shannon and Chernoff information are two metrics of the same quantity of information.

IV-B The Nature of Channel Information

Obviously, the ability to convey information of a channel is decided by input parameters. That is to say, the parameters represent the channel. As a result, we hold opinion that these channel parameters decide the nature of channel. It is noted that $\varphi\left({\textbf{X}}\right)$ in Eq.(12) is the same whatever we adopt the Shannon channel information measures or Chernoff channel information measures. In this paper, we take $\varphi\left({\textbf{X}}\right)$ as the nature of channel information and we get

[TABLE]

where the $\varPsi$ is the nature of channel information. It is only decided by the channel parameters and it contains the core of the channel information. The other physical quantity about channel information is just the function of $\varPsi$ , just as

[TABLE]

where $\rm{I}$ is a measurement of channel information. As far as Shannon mutual information and Chernoff mutual information are concerned, their only difference is the function $\psi(\cdot)$ . They can be seen as a compound function. The inner is $\varPsi$ which is the function w.r.t. channel parameters and the outer is the function w.r.t. $\varPsi$ . For different channel information measures, the inner is always the same but the outer is different.

There is a conjecture that if a novel information measure is put forward to measure the channel information, it will be decomposed to a function of $\psi(\cdot)$ by ACE algorithm from a very large extent. This rule can be used to judge the rationality of the new information measure. This conjecture also provides a way for us to propose a new information measure for the fact that we can construct that based on designing the function $\psi(\cdot)$ . In fact, even if this guess is not right with probability one, the new one which conforms to the law is correct with large probability and we can easily find the relation between that and previous information measures, such as Shannon and Chernoff information.

V Conclusion

In this paper, the different channel information measures in several channels are revisited in Big Data viewpoint by ACE algorithm. Fortunately, the decomposition results of independent variable is the same with a slightly difference in the function of channel information values. That is to say, Shannon shakes hands with Chernoff with the fact that they are just two metrics of the same quantity of channel information. For every channel, there is a nature of channel information, which is only decided by the channel parameters no matter what information measures are adopted. In fact, the Big Data viewpoint such as ACE algorithm provides a new viewpoint for us to reexamine the Information Theory and find out that Shannon and Chernoff shake hands with each other, which have been ignored for decades. Furthermore, this method can help us to construct new information measures with keeping the nature of the channel information unchanged. Obviously, we put forward a criterion to judge whether a new channel information measure be appropriate.

Acknowledgement

This work was supported by the China Major State Basic Research Development Program (973 Program) No.2012CB316100(2) and National Natural Science Foundation of China (NSFC) NO.61171064.

Bibliography11

The reference list from the paper itself. Each links out to its DOI / PubMed record.

11 S. Verdu, “Fifty years of shannon theory,” IEEE Transactions on information theory , vol. 44, no. 6, pp. 2057–2078, 1998.
22 A. Rrnyi, “On measures of entropy and information,” in Fourth Berkeley symposium on mathematical statistics and probability , vol. 1, 1961, pp. 547–561.
33 T. Van Erven and P. Harremoës, “Rényi divergence and kullback-leibler divergence,” Information Theory, IEEE Transactions on , vol. 60, no. 7, pp. 3797–3820, 2014.
44 J. C. Principe, D. Xu, and J. Fisher, “Information theoretic learning,” Unsupervised adaptive filtering , vol. 1, pp. 265–319, 2000.
55 Y. Bao and H. Krim, “Renyi entropy based divergence measures for ica,” in Statistical Signal Processing, 2003 IEEE Workshop on . IEEE, 2003, pp. 565–568.
66 K. E. Hild, D. Erdogmus, J. Príncipe et al. , “Blind source separation using renyi’s mutual information,” Signal Processing Letters, IEEE , vol. 8, no. 6, pp. 174–176, 2001.
77 K. E. Hild, D. Erdogmus, and J. C. Principe, “An analysis of entropy estimators for blind source separation,” Signal Processing , vol. 86, no. 1, pp. 182–194, 2006.
88 T. M. Cover and J. A. Thomas, Elements of information theory . New Jersey, the USA: Wiley, 2006.