$H(X)$ vs. $H(f(X))$
Ferdinando Cicalese, Luisa Gargano, Ugo Vaccaro

TL;DR
This paper derives tight bounds on the entropy of a function of a random variable when the function is not one-to-one, improving existing bounds and exploring scenarios where this is relevant.
Contribution
It provides new tight bounds on $H(f(X))$ for non-injective functions and introduces an improved lower bound on distribution entropy based on probability ratio constraints.
Findings
Tight bounds on $H(f(X))$ for non-one-to-one functions.
An improved lower bound on distribution entropy based on max-min probability ratio.
Illustrations of scenarios where entropy bounds are significant.
Abstract
It is well known that the entropy of a finite random variable is always greater or equal to the entropy of a function of , with equality if and only if is one-to-one. In this paper, we give tights bounds on when the function is not one-to-one, and we illustrate a few scenarios where this matters. As an intermediate step towards our main result, we prove a lower bound on the entropy of a probability distribution, when only a bound on the ratio between the maximum and the minimum probability is known. Our lower bound improves previous results in the literature, and it could find applications outside the present scenario.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWireless Communication Security Techniques · Statistical Mechanics and Entropy · Risk and Portfolio Optimization
vs.
Ferdinando Cicalese
Università di Verona, Verona, Italy
Email: [email protected]
Luisa Gargano
and Ugo Vaccaro
Università di Salerno, Salerno, Italy
Email: [email protected]
Università di Salerno, Salerno, Italy
Email: [email protected]
Abstract
It is well known that the entropy of a finite random variable is always greater or equal to the entropy of a function of , with equality if and only if is one-to-one. In this paper, we give tights bounds on when the function is not one-to-one, and we illustrate a few scenarios where this matters. As an intermediate step towards our main result, we prove a lower bound on the entropy of a probability distribution, when only a bound on the ratio between the maximum and the minimum probability is known. Our lower bound improves previous results in the literature, and it could find applications outside the present scenario.
I The Problem
Let be a finite alphabet, and be any random variable (r.v.) taking values in according to the probability distribution , that is, such that , for . A well known and widely used inequality (see [5], Exercise 2.4), states that
[TABLE]
where is any function defined on , and denotes the Shannon entropy. Moreover, equality holds in (1) if and only if is one-to-one. The main purpose of this paper is to sharpen inequality (1) by deriving tight bounds on when is not one-to-one. More precisely, given the r.v. , an integer , a set , and the family of surjective functions , we want to compute the values
[TABLE]
II The Results
For any probability distribution , with , and integer , let us define the probability distributions as follows: if we set , whereas if we set , where
[TABLE]
and is the maximum index such that . A somewhat similar operator was introduced in [9].
Additionally, we define the probability distributions in the following way:
[TABLE]
The following Theorem provides the results seeked in (2).
Theorem 1**.**
For any r.v. taking values in the alphabet according to the probability distribution , and for any , it holds that
[TABLE]
where , and
[TABLE]
Therefore, the function for which is minimum maps all the elements to a single element, and it is one-to-one on the remaining elements .
Before proving Theorem 1 and discuss its consequences, we would like to notice that there are quite compelling reasons why we are unable to determine the exact value of the maximum in (5), and consequently, the form of the function that attains the bound. Indeed, computing the value is an NP-hard problem. It is easy to understand the difficulty of the problem already in the simple case . To that purpose, consider any function , that is , and let be any r.v. taking values in according to the probability distribution . Let Then, , and it is maximal in correspondence of a function that makes the sums and as much equal as possible. This is equivalent to the well known NP-hard problem Partition on the instance (see [7])222In the full version of the paper we will show that the problem of computing the value is strongly NP-hard. Since the function for which can be efficiently constructed, we have also the following important consequence of Theorem 1.
Corollary 1**.**
There is a polynomial time algorithm to approximate the NP-hard problem of computing the value
[TABLE]
with an additive approximation factor of .
A key tool for the proof of Theorem 1 is the following result, proved in the second part of Section IV.
Theorem 2**.**
Let be a probability distribution such that . If then
[TABLE]
Theorem 2 improves on several papers (see [17] and references therein quoted), that have studied the problem of estimating when only a bound on the ratio is known.333The bound in [17] has this form: if , then . One can see that our bound (7) is tighter. We believe the result to be of independent interest. For instance, it can also be used to improve existing bounds on the leaf-entropy of parse trees generated by Tunstall algorithm.
To prove our results, we use ideas and techniques from Majorization Theory [15], a mathematical framework that has been proved to be very much useful in Information Theory (e.g., see [2, 3, 9, 10] and references therein quoted).
III Some Applications
Besides its inherent naturalness, the problem of estimating the entropy vs. has several interesting applications. We highlight some of them here, postponing a more complete discussion in the full version of the paper.
In the area of clustering, one seeks a mapping (deterministic or stochastic) from some data, generated by a r.v. taking values in a set , to “clusters” in , where . A widely employed measure to appraise the goodness of a clustering algorithm is the information that the clusters retain towards the original data, measured by the mutual information (see [6, 11] and references therein quoted). In general, one wants to choose such that is small but is large. The authors of [8] (see also [13]) proved that, given the random variable , among all mappings that maximizes (under the constraint that is fixed) there is a maximizing function that is deterministic. Since in the case of deterministic functions it holds that , finding the clustering of (into a fixed number of clusters) that maximizes the mutual information is equivalent to our problem of finding the function that attains the upper bound in (2).444In [13] the authors consider the problem of determining the function that maximizes , where is the r.v. at the input of a DMC and is the corresponding output. Our scenario could be seen as the particular case when the DMC is noiseless. However, the results in [13] do not imply ours since the authors give algorithms only for binary input channels (i.e. , that makes the problem completely trivial in our case). Instead, our results are relevant to those of [13]. For instance, we obtain that the general maximization problem considered in [13] is NP-hard, a fact unnoticed in [13].
Another scenario where our results directly find applications is the one considered in [18]. There, the author considers the problem of best approximating a probability distribution with a shorter one , . The criterion with which one chooses , given , is the following. Given and , define the quantity as , where is the minimum entropy of a bivariate probability distribution that has and as marginals. Then, the “best” approximation of is chosen as the probability distributions with components that minimizes , over all . The author of [18] shows that can be characterized in the following way. Given , call an aggregation of into components if there is a partition of into disjoint sets such that , for . In [18] it is proved that the vector that best approximate (according to ) is the aggregation of into components of maximum entropy. Since any aggregation of can be seen as the distribution of the r.v. , where is some appropriate function and is a r.v. distributed according to (and, vice versa, any deterministic gives a r.v. whose distribution is an aggregation of the distribution of ), one gets that the problem of computing the “best” approximation of is NP-hard. The bound (5) allows us to provide an approximation algorithm to construct a probability distribution such that , improving on [4], where an approximation algorithm for the same problem with an additive error of was provided.
There are other problems that can be cast in our scenario. For instance, Baez et al. [1] give an axiomatic characterization of the Shannon entropy in terms of information loss. Stripping away the Category Theory language of [1], the information loss of a r.v. amounts to the difference , where is any deterministic function. Our Theorem 1 allows to quantify the extreme value of the information loss of a r.v., when the support of is known.
There is also a vast literature (see [14], Section 3.3, and references therein quoted) studying the “leakage of a program […] defined as the (Shannon) entropy of the partition ” [14]. One can easily see that their “leakage” is the same as the entropy , where is the r.v. modeling the program input, and is the function describing the input-output relation of the program . In Section 8 of the same paper the authors study the problem of maximizing or minimizing the leakage, in the case the program is stochastic, using standard techniques based on Lagrange multipliers. They do not consider the (harder) case of deterministic programs (i.e., deterministic ’s) and our results are likely to be relevant in that context.
Finally, we remark that our problem can also be seen as a problem of quantizing the alphabet of a discrete source into a smaller one (e.g., [16]), and the goal is to maximize the mutual information between the original source and the quantized one.
IV The Proofs
We first recall the important concept of majorization among probability distributions.
Definition 1**.**
[15]* Given two probability distributions and with and , we say that is majorized by , and write , if and only if*
[TABLE]
Without loss of generality we assume that all the probabilities distributions we deal with have been ordered in non-increasing order. We also use the majorization relationship between vectors of unequal lenghts, by properly padding the shorter one with the appropriate number of [math]’s at the end.
Consider an arbitrary function , . Any r.v. taking values in , according to the probability distribution , and the function naturally induce a r.v. , taking values in according to the probability distribution whose values are given by the expressions
[TABLE]
Let be the vector containing the values ordered in non-increasing fashion. For convenience, we state the following self-evident fact about the relationships between and .
Claim 1**.**
There is a partition of into disjoint sets such that , for .
Therefore, is an aggregation of . Given a r.v. distributed according to , and any , by simply applying the definition of majorization one can see that the (ordered) probability distribution of the r.v. is majorized by , as defined in (4). Therefore, by invoking the Schur concavity of the entropy function (see [15], p. 101 for the statement, and [10] for an improvement), saying that whenever , we get that . From this, the equality (6) immediately follows.
We need the following two simple results, but important to us, stated and proved in [4] with a different terminology.
Lemma 1**.**
[4]* For and as above, it holds that *
In other words, for any r.v. and function , the probability distribution of always majorizes that of .
Lemma 2**.**
[4]* For any , , and probability distribution such that , it holds that*
[TABLE]
*where is the probability distribution defined in *(3). **
From Lemmas 1 and 2, and by applying the Schur concavity of the entropy function , we get the following result.
Corollary 2**.**
For any r.v. taking values in according to a probability distribution , and for any , it holds that
[TABLE]
Above corollary implies that
[TABLE]
Therefore, to complete the proof of Theorem 1 we need to show that we can construct a function such that
[TABLE]
or, equivalently, that we can construct an aggregation of into components, whose entropy is at least We prove this fact in the following lemma.
Lemma 3**.**
For any and , we can construct an aggregation of such that
[TABLE]
Proof:
We will assemble the aggregation through the Huffman algorithm. We first make the following observation. To the purposes of this paper, each step of the Huffman algorithm consists in merging the two smallest element and of the current probability distribution, deleting and and substituting them with the single element , and reordering the new probability distribution from the largest element to the smallest (ties are arbitrarily broken). Immediately after the step in which and are merged, each element in the new and reduced probability distribution that finds itself positioned at the “right” of (if there is such a ) has a value that satisfies (since, by choice, ). Let be the ordered probability distribution obtained by executing exactly steps of the Huffman algorithm, starting from the distribution . Denote by the maximum index such that for each the component has not been produced by a merge operation of the Huffman algorithm. In other word, is the maximum index such that for each it holds that . Notice that we allow to be equal to [math]. Therefore has been produced by a merge operation. At the step in which the value was created, it holds that , for any at the “right” of . At later steps, the inequality still holds, since elements at the right of could have only increased their values.
Let be the sum of the last (smallest) components of . The vector is a probability distribution such that the ratio between its largest and its smallest component is upper bounded by 2. By Theorem 2, with , it follows that
[TABLE]
where . Therefore, we have
[TABLE]
Let and observe that coincides with in the first components, as it does . What we have shown is that
[TABLE]
We now observe that , where is the index that intervenes in the definition of our operator (see (3)). In fact, by the definition of one has , that also implies
[TABLE]
Moreover, since the first components of are the same as in , we also have . This, together with relation (14), implies
[TABLE]
Equation (15) clearly implies since is by definition, the maximum index such that From the just proved inequality , we have also
[TABLE]
Using (13), (16), and the Schur concavity of the entropy function, we get
[TABLE]
thus completing the proof of the Lemma (and of Theorem 1). ∎
We now prove Theorem 2. Again, we use tools from majorization theory. Consider an arbitrary probability distribution with and . Let us define the probability distribution
[TABLE]
where . It is easy to verify that .
Lemma 4**.**
Let with be any probability distribution with . The probability distribution satisfies
Proof:
For any , it holds that
[TABLE]
Consider now some and assume by contradiction that . It follows that . As a consequence we get the contradiction . ∎
Lemma 4 and the Schur concavity of the entropy imply that . We can therefore prove Theorem 2 by showing the appropriate upper bound on .
Lemma 5**.**
It holds that
[TABLE]
Proof:
Consider the class of probability distributions of the form
[TABLE]
having the first components equal to and the last equal to , for suitable , and such that
[TABLE]
Clearly, for and one has , and we can prove the lemma by upper bounding the maximum (over all and ) of . Let
[TABLE]
From (18), for any value of , one has that
[TABLE]
Set . We have
[TABLE]
Since for any , the function is -convex in this interval, and it is upper bounded by the maximum between the two extrema values and . Therefore, we can upper bound by the maximum value among
[TABLE]
for . We now interpret as a continuous variable, and we differentiate with respect to . We get
[TABLE]
that is positive if and only if Therefore, the desired upper bound on can be obtained by computing the value of , where and . The value of turns out to be equal to
[TABLE]
∎
We conclude the paper by showing how Theorems 1 and 2 allow us to design an approximation algorithm for the second problem mentioned in Section III, that is, the problem of constructing a probability distribution such that . Our algorithm improves on the result presented in [4], where an approximation algorithm for the same problem with an additive error of was provided.
Let be the probability distribution constructed in Lemma 3 and let us recall that the first components of coincide with the first components of . In addition, for each there is a set such that and the ’s form a partition of (i.e., is an aggregation of into components).
We now build a bivariate probability distribution , having and as marginals, as follows:
- •
in the first rows and columns, the matrix has non-zero components only on the diagonal, namely and for any such that ;
- •
for each row the only non-zero elements are the ones in the columns corresponding to elements of and precisely, for each we set
It is not hard to see that has and as marginals. Moreover we have that since by construction the only non-zero components of coincide with the set of components of Let be the set of all bivariate probability distribution having and as marginals. Recall that . We have that
[TABLE]
where (19) is the definition of ; (20) follows from (19) since ; (21) follows from (20) because of ; (22) follows from Lemma 3; (23) follows from (22), the known fact that is an aggregation of (see [18]) and Lemmas 1 and 2. Finally, the general inequality is formula (48) in [12].
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J.C. Baez, T. Fritz and T. Leinster, “A Characterization of entropy in terms of information loss”, Entropy , vol. 17 , 772–789, 2015.
- 2[2] F. Cicalese and U. Vaccaro, “Supermodularity and subadditivity properties of the entropy on the majorization lattice”, IEEE Transactions on Information Theory , vol. 48 , 933–938, 2002.
- 3[3] F. Cicalese and U. Vaccaro, “Bounding the average length of optimal source codes via majorization theory”, IEEE Transactions on Information Theory , vol. 50 , 633–637, 2004.
- 4[4] F. Cicalese, L. Gargano, and U. Vaccaro, “Approximating probability distributions with short vectors, via information theoretic distance measures”, in: Proceedings of ISIT 2016 , pp. 1138-1142, 2016.
- 5[5] T. M. Cover and J. A. Thomas, Elements of Information Theory , Wiley-Interscience; 2nd edition (2006).
- 6[6] L. Faivishevsky and J. Faivishevsky, “Nonparametric information theoretic clustering algorithm”, in: Proceedings of the 27th International Conference on Machine Learning (ICML-10) , pp. 351–358, 2010.
- 7[7] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness , W. H. Freeman (1979).
- 8[8] B.C. Geiger and R.A. Amjad, “Hard Clusters Maximize Mutual Information”, ar Xiv:1608.04872 [cs.IT]
