$H(X)$ vs. $H(f(X))$

Ferdinando Cicalese; Luisa Gargano; Ugo Vaccaro

arXiv:1704.07059·cs.IT·April 25, 2017

$H(X)$ vs. $H(f(X))$

Ferdinando Cicalese, Luisa Gargano, Ugo Vaccaro

PDF

Open Access

TL;DR

This paper derives tight bounds on the entropy of a function of a random variable when the function is not one-to-one, improving existing bounds and exploring scenarios where this is relevant.

Contribution

It provides new tight bounds on $H(f(X))$ for non-injective functions and introduces an improved lower bound on distribution entropy based on probability ratio constraints.

Findings

01

Tight bounds on $H(f(X))$ for non-one-to-one functions.

02

An improved lower bound on distribution entropy based on max-min probability ratio.

03

Illustrations of scenarios where entropy bounds are significant.

Abstract

It is well known that the entropy $H (X)$ of a finite random variable is always greater or equal to the entropy $H (f (X))$ of a function $f$ of $X$ , with equality if and only if $f$ is one-to-one. In this paper, we give tights bounds on $H (f (X))$ when the function $f$ is not one-to-one, and we illustrate a few scenarios where this matters. As an intermediate step towards our main result, we prove a lower bound on the entropy of a probability distribution, when only a bound on the ratio between the maximum and the minimum probability is known. Our lower bound improves previous results in the literature, and it could find applications outside the present scenario.

Equations86

H (f (X)) \leq H (X),

H (f (X)) \leq H (X),

f \in F_{m} max H (f (X)) \mbox an d f \in F_{m} min H (f (X)) .

f \in F_{m} max H (f (X)) \mbox an d f \in F_{m} min H (f (X)) .

r_{i} = {p_{i} (\sum_{j = i^{*} + 1}^{n} p_{j}) / (m - i^{*}) for i = 1, \dots, i^{*} for i = i^{*} + 1, \dots, m,

r_{i} = {p_{i} (\sum_{j = i^{*} + 1}^{n} p_{j}) / (m - i^{*}) for i = 1, \dots, i^{*} for i = i^{*} + 1, \dots, m,

q_{i} = {\sum_{k = 1}^{n - m + 1} p_{k}, p_{n - m + i}, for i = 1, for i = 2, \dots, m .

q_{i} = {\sum_{k = 1}^{n - m + 1} p_{k}, p_{n - m + i}, for i = 1, for i = 2, \dots, m .

f \in F_{m} max H (f (X)) \in [H (R_{m} (p)) - α, H (R_{m} (p))],

f \in F_{m} max H (f (X)) \in [H (R_{m} (p)) - α, H (R_{m} (p))],

f \in F_{m} min H (f (X)) = H (Q_{m} (p)) .

f \in F_{m} min H (f (X)) = H (Q_{m} (p)) .

f \in F_{m} max H (f (X)),

f \in F_{m} max H (f (X)),

H (p) \geq lo g n - (\frac{ρ ln ρ}{ρ - 1} - 1 - ln \frac{ρ ln ρ}{ρ - 1}) \frac{1}{ln 2} .

H (p) \geq lo g n - (\frac{ρ ln ρ}{ρ - 1} - 1 - ln \frac{ρ ln ρ}{ρ - 1}) \frac{1}{ln 2} .

k = 1 \sum i a_{k} \leq k = 1 \sum i b_{k}, \mbox forall i = 1, \dots, n .

k = 1 \sum i a_{k} \leq k = 1 \sum i b_{k}, \mbox forall i = 1, \dots, n .

\forall y_{j} \in Y P {f (X) = y_{j}} = x \in X : f (x) = y_{j} \sum P {X = x} .

\forall y_{j} \in Y P {f (X) = y_{j}} = x \in X : f (x) = y_{j} \sum P {X = x} .

R_{m} (p) ⪯ a,

R_{m} (p) ⪯ a,

H (f (X)) \leq H (R_{m} (p)) .

H (f (X)) \leq H (R_{m} (p)) .

f \in F_{m} max H (f (X)) \leq H (R_{m} (p)) .

f \in F_{m} max H (f (X)) \leq H (R_{m} (p)) .

H (f (X)) \geq H (R_{m} (p)) - (1 - \frac{1 + ln ( ln 2 )}{ln 2}),

H (f (X)) \geq H (R_{m} (p)) - (1 - \frac{1 + ln ( ln 2 )}{ln 2}),

H (q) \geq H (R_{m} (p)) - (1 - \frac{1 + ln ( ln 2 )}{ln 2})) .

H (q) \geq H (R_{m} (p)) - (1 - \frac{1 + ln ( ln 2 )}{ln 2})) .

H (q^{'}) \geq lo g (m - i_{q}) - α,

H (q^{'}) \geq lo g (m - i_{q}) - α,

H (q)

H (q)

H (q) \geq H (q^{*}) - α .

H (q) \geq H (q^{*}) - α .

\frac{\sum _{j = i_{q} + 1}^{m} q _{j}}{m} \leq q_{i_{q} + 1} \leq q_{i_{q}} = p_{i_{q}} .

\frac{\sum _{j = i_{q} + 1}^{m} q _{j}}{m} \leq q_{i_{q} + 1} \leq q_{i_{q}} = p_{i_{q}} .

\frac{\sum _{j = i_{q} + 1}^{n} p _{j}}{m} \leq p_{i_{q}} .

\frac{\sum _{j = i_{q} + 1}^{n} p _{j}}{m} \leq p_{i_{q}} .

q^{*} ⪯ R (p) .

q^{*} ⪯ R (p) .

H (q) \geq H (q^{*}) - α \geq H (R (p)) - α,

H (q) \geq H (q^{*}) - α \geq H (R (p)) - α,

z_{ρ} (p) = (z_{1}, \dots, z_{n})

z_{ρ} (p) = (z_{1}, \dots, z_{n})

= (i \mbox t im es ρ p_{n}, \dots, ρ p_{n},

p_{1} + \dots + p_{j} \leq j p_{1} \leq j (ρ p_{n}) = z_{1} + \dots + z_{j} .

p_{1} + \dots + p_{j} \leq j p_{1} \leq j (ρ p_{n}) = z_{1} + \dots + z_{j} .

lo g n - H (z_{ρ} (p)) \leq (\frac{ρ ln ρ}{ρ - 1} - 1 - ln \frac{ρ ln ρ}{ρ - 1}) \frac{1}{ln 2} .

lo g n - H (z_{ρ} (p)) \leq (\frac{ρ ln ρ}{ρ - 1} - 1 - ln \frac{ρ ln ρ}{ρ - 1}) \frac{1}{ln 2} .

z_{ρ} (x, i) = (ρ x, \dots, ρ x, 1 - (n + i (ρ - 1) - 1) x, x, \dots, x),

z_{ρ} (x, i) = (ρ x, \dots, ρ x, 1 - (n + i (ρ - 1) - 1) x, x, \dots, x),

1 - (n + i (ρ - 1) - 1) x \in [x, ρ x) .

1 - (n + i (ρ - 1) - 1) x \in [x, ρ x) .

f (x,

f (x,

+

+ (n - i - 1) x lo g x .

x \in (\frac{1}{n + ( i + 1 ) ( ρ - 1 )}, \frac{1}{n + i ( ρ - 1 )}]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWireless Communication Security Techniques · Statistical Mechanics and Entropy · Risk and Portfolio Optimization

Full text

$H(X)$ vs. $H(f(X))$

Ferdinando Cicalese

Università di Verona, Verona, Italy

Email: [email protected]

Luisa Gargano

and Ugo Vaccaro

Università di Salerno, Salerno, Italy

Email: [email protected]

Università di Salerno, Salerno, Italy

Email: [email protected]

Abstract

It is well known that the entropy $H(X)$ of a finite random variable is always greater or equal to the entropy $H(f(X))$ of a function $f$ of $X$ , with equality if and only if $f$ is one-to-one. In this paper, we give tights bounds on $H(f(X))$ when the function $f$ is not one-to-one, and we illustrate a few scenarios where this matters. As an intermediate step towards our main result, we prove a lower bound on the entropy of a probability distribution, when only a bound on the ratio between the maximum and the minimum probability is known. Our lower bound improves previous results in the literature, and it could find applications outside the present scenario.

I The Problem

Let ${\cal X}=\{x_{1},\ldots,x_{n}\}$ be a finite alphabet, and $X$ be any random variable (r.v.) taking values in ${\cal X}$ according to the probability distribution ${\bf p}=(p_{1},p_{2},\ldots,p_{n})$ , that is, such that $P\{X=x_{i}\}=p_{i}$ , for $i=1,2,\ldots,n$ . A well known and widely used inequality (see [5], Exercise 2.4), states that

[TABLE]

where $f:{\cal X}\to{\cal Y}$ is any function defined on ${\cal X}$ , and $H(\cdot)$ denotes the Shannon entropy. Moreover, equality holds in (1) if and only if $f$ is one-to-one. The main purpose of this paper is to sharpen inequality (1) by deriving tight bounds on $H(f(X))$ when $f$ is not one-to-one. More precisely, given the r.v. $X$ , an integer $2\leq m<n$ , a set ${\cal Y}_{m}=\{y_{1},\ldots,y_{m}\}$ , and the family of surjective functions ${\cal F}_{m}=\{f|\;f:{\cal X}\to{\cal Y}_{m},\ |f({\cal X})|=m\}$ , we want to compute the values

[TABLE]

II The Results

For any probability distribution ${\bf p}=(p_{1},p_{2},\ldots,p_{n})$ , with $p_{1}\geq p_{2},\ldots,\geq p_{n}\geq 0$ , and integer $2\leq m<n$ , let us define the probability distributions $R_{m}({\bf p})=(r_{1},\ldots,r_{m})$ as follows: if $p_{1}<1/m$ we set $R_{m}({\bf p})=(1/m,\ldots,1/m)$ , whereas if $p_{1}\geq 1/m$ we set $R_{m}({\bf p})=(r_{1},\ldots,r_{m})$ , where

[TABLE]

and $i^{*}$ is the maximum index $i$ such that $p_{i}\geq\frac{\sum_{j=i+1}^{n}p_{j}}{m-i}$ . A somewhat similar operator was introduced in [9].

Additionally, we define the probability distributions $Q_{m}({\bf p})=(q_{1},\ldots,q_{m})$ in the following way:

[TABLE]

The following Theorem provides the results seeked in (2).

Theorem 1.

For any r.v. $X$ taking values in the alphabet ${\cal X}=\{x_{1},x_{2},\ldots,x_{n}\}$ according to the probability distribution ${\bf p}=(p_{1},p_{2},\ldots,p_{n})$ , and for any $2\leq m<n$ , it holds that

[TABLE]

where $\alpha=1-({1+\ln(\ln 2)})/{\ln 2}<0.0861$ , and

[TABLE]

Therefore, the function $f\in{\cal F}_{m}$ for which $H(f(X))$ is minimum maps all the elements $x_{1},\ldots,x_{n-m+1}\in{\cal X}$ to a single element, and it is one-to-one on the remaining elements $x_{n-m+2},\ldots,x_{n}$ .

Before proving Theorem 1 and discuss its consequences, we would like to notice that there are quite compelling reasons why we are unable to determine the exact value of the maximum in (5), and consequently, the form of the function $f\in{\cal F}_{m}$ that attains the bound. Indeed, computing the value $\max_{f\in{\cal F}_{m}}H(f(X))$ is an NP-hard problem. It is easy to understand the difficulty of the problem already in the simple case $m=2$ . To that purpose, consider any function $f\in{\cal F}_{2}$ , that is $f:{\cal X}\to{\cal Y}_{2}=\{y_{1},y_{2}\}$ , and let $X$ be any r.v. taking values in ${\cal X}$ according to the probability distribution ${\bf p}=(p_{1},p_{2},\ldots,p_{n})$ . Let $z_{1}=\!\!\!\sum_{x\in{\cal X}:f(x)=y_{1}}P\{X=x\},\quad z_{2}=\!\!\!\sum_{x\in{\cal X}:f(x)=y_{2}}P\{X=x\}.$ Then, $H(f(X))=-z_{1}\log z_{1}-z_{2}\log z_{2}$ , and it is maximal in correspondence of a function $f\in{\cal F}_{2}$ that makes the sums $z_{1}$ and $z_{2}$ as much equal as possible. This is equivalent to the well known NP-hard problem Partition on the instance $\{p_{1},\ldots,p_{n}\}$ (see [7])222In the full version of the paper we will show that the problem of computing the value $\max_{f\in{\cal F}_{m}}H(f(X))$ is strongly NP-hard. Since the function $f\in{\cal F}_{m}$ for which $H(f(X))\geq H(R_{m}({\bf p}))-\alpha$ can be efficiently constructed, we have also the following important consequence of Theorem 1.

Corollary 1.

There is a polynomial time algorithm to approximate the NP-hard problem of computing the value

[TABLE]

with an additive approximation factor of $\alpha\leq 0.0861$ .

A key tool for the proof of Theorem 1 is the following result, proved in the second part of Section IV.

Theorem 2.

Let ${\bf p}=(p_{1},p_{2},\ldots,p_{n})$ be a probability distribution such that $p_{1}\geq p_{2}\geq\ldots\geq p_{n}>0$ . If $p_{1}/p_{n}\leq{\rho}$ then

[TABLE]

Theorem 2 improves on several papers (see [17] and references therein quoted), that have studied the problem of estimating $H({\bf p})$ when only a bound on the ratio $p_{1}/p_{n}$ is known.333The bound in [17] has this form: if $p_{1}/p_{n}\leq 1+2(e^{\epsilon}-1)+2\sqrt{e^{2\epsilon}-e^{\epsilon}}$ , then $H(X)\geq\log n-\epsilon$ . One can see that our bound (7) is tighter. We believe the result to be of independent interest. For instance, it can also be used to improve existing bounds on the leaf-entropy of parse trees generated by Tunstall algorithm.

To prove our results, we use ideas and techniques from Majorization Theory [15], a mathematical framework that has been proved to be very much useful in Information Theory (e.g., see [2, 3, 9, 10] and references therein quoted).

III Some Applications

Besides its inherent naturalness, the problem of estimating the entropy $H(f(X))$ vs. $H(X)$ has several interesting applications. We highlight some of them here, postponing a more complete discussion in the full version of the paper.

In the area of clustering, one seeks a mapping $f$ (deterministic or stochastic) from some data, generated by a r.v. $X$ taking values in a set ${\cal X}$ , to “clusters” in ${\cal Y}$ , where $|{\cal Y}|\ll|{\cal X}|$ . A widely employed measure to appraise the goodness of a clustering algorithm is the information that the clusters retain towards the original data, measured by the mutual information $I(X;f(X))$ (see [6, 11] and references therein quoted). In general, one wants to choose $f$ such that $|f({\cal X})|$ is small but $I(X;f(X))$ is large. The authors of [8] (see also [13]) proved that, given the random variable $X$ , among all mappings $f$ that maximizes $I(X;f(X))$ (under the constraint that $|f({\cal X})|$ is fixed) there is a maximizing function $f$ that is deterministic. Since in the case of deterministic functions it holds that $I(X;f(X))=H(f(X))$ , finding the clustering $f$ of ${\cal X}$ (into a fixed number $m$ of clusters) that maximizes the mutual information $I(X;f(X))$ is equivalent to our problem of finding the function $f$ that attains the upper bound in (2).444In [13] the authors consider the problem of determining the function $f$ that maximizes $I(X;f(Y))$ , where $X$ is the r.v. at the input of a DMC and $Y$ is the corresponding output. Our scenario could be seen as the particular case when the DMC is noiseless. However, the results in [13] do not imply ours since the authors give algorithms only for binary input channels (i.e. $n=2$ , that makes the problem completely trivial in our case). Instead, our results are relevant to those of [13]. For instance, we obtain that the general maximization problem considered in [13] is NP-hard, a fact unnoticed in [13].

Another scenario where our results directly find applications is the one considered in [18]. There, the author considers the problem of best approximating a probability distribution ${\bf p}=(p_{1},\ldots,p_{n})$ with a shorter one ${\bf q}^{*}=(q^{*}_{1},\ldots,q^{*}_{m})$ , $m\leq n$ . The criterion with which one chooses ${\bf q}^{*}$ , given ${\bf p}$ , is the following. Given ${\bf p}=(p_{1},\ldots,p_{n})$ and ${\bf q}=(q_{1},\ldots,q_{m})$ , define the quantity ${\tt D}({\bf p},{\bf q})$ as $2W({\bf p},{\bf q})-H({\bf p})-H({\bf q})$ , where $W({\bf p},{\bf q})$ is the minimum entropy of a bivariate probability distribution that has ${\bf p}$ and ${\bf q}$ as marginals. Then, the “best” approximation ${\bf q}^{*}$ of ${\bf p}$ is chosen as the probability distributions ${\bf q}^{*}$ with $m$ components that minimizes ${\tt D}({\bf p},{\bf q})$ , over all ${\bf q}=(q_{1},\ldots,q_{m})$ . The author of [18] shows that ${\bf q}^{*}$ can be characterized in the following way. Given ${\bf p}=(p_{1},\ldots,p_{n})$ , call ${\bf q}=(q_{1},\ldots,q_{m})$ an aggregation of ${\bf p}$ into $m$ components if there is a partition of $\{1,\ldots,n\}$ into disjoint sets $I_{1},\ldots,I_{m}$ such that $q_{k}=\sum_{i\in I_{k}}p_{i}$ , for $k=1,\ldots m$ . In [18] it is proved that the vector ${\bf q}^{*}$ that best approximate ${\bf p}$ (according to ${\tt D}$ ) is the aggregation of ${\bf p}$ into $m$ components of maximum entropy. Since any aggregation ${\bf q}$ of ${\bf p}$ can be seen as the distribution of the r.v. $f(X)$ , where $f$ is some appropriate function and $X$ is a r.v. distributed according to ${\bf p}$ (and, vice versa, any deterministic $f$ gives a r.v. $f(X)$ whose distribution is an aggregation of the distribution of $X$ ), one gets that the problem of computing the “best” approximation ${\bf q}^{*}$ of ${\bf p}$ is NP-hard. The bound (5) allows us to provide an approximation algorithm to construct a probability distribution $\overline{{\bf q}}=(\overline{q}_{1},\ldots,\overline{q}_{m})$ such that ${\tt D}({\bf p},\overline{{\bf q}})\leq{\tt D}({\bf p},{\bf q}^{*})+0.0861$ , improving on [4], where an approximation algorithm for the same problem with an additive error of $1$ was provided.

There are other problems that can be cast in our scenario. For instance, Baez et al. [1] give an axiomatic characterization of the Shannon entropy in terms of information loss. Stripping away the Category Theory language of [1], the information loss of a r.v. $X$ amounts to the difference $H(X)-H(f(X))$ , where $f$ is any deterministic function. Our Theorem 1 allows to quantify the extreme value of the information loss of a r.v., when the support of $f(X)$ is known.

There is also a vast literature (see [14], Section 3.3, and references therein quoted) studying the “leakage of a program $P$ […] defined as the (Shannon) entropy of the partition $\Pi(P)$ ” [14]. One can easily see that their “leakage” is the same as the entropy $H(f(X))$ , where $X$ is the r.v. modeling the program input, and $f$ is the function describing the input-output relation of the program $P$ . In Section 8 of the same paper the authors study the problem of maximizing or minimizing the leakage, in the case the program $P$ is stochastic, using standard techniques based on Lagrange multipliers. They do not consider the (harder) case of deterministic programs (i.e., deterministic $f$ ’s) and our results are likely to be relevant in that context.

Finally, we remark that our problem can also be seen as a problem of quantizing the alphabet of a discrete source into a smaller one (e.g., [16]), and the goal is to maximize the mutual information between the original source and the quantized one.

IV The Proofs

We first recall the important concept of majorization among probability distributions.

Definition 1.

[15]* Given two probability distributions ${\bf a}=(a_{1},\ldots,a_{n})$ and ${\bf b}=(b_{1},\ldots,b_{n})$ with $a_{1}\geq\ldots\geq a_{n}\geq 0$ and $b_{1}\geq\ldots\geq b_{n}\geq 0$ , we say that ${\bf a}$ is majorized by ${\bf b}$ , and write ${\bf a}\preceq{\bf b}$ , if and only if*

[TABLE]

Without loss of generality we assume that all the probabilities distributions we deal with have been ordered in non-increasing order. We also use the majorization relationship between vectors of unequal lenghts, by properly padding the shorter one with the appropriate number of [math]’s at the end.

Consider an arbitrary function $f:{\cal X}\to{\cal Y}$ , $f\in{\cal F}_{m}$ . Any r.v. $X$ taking values in ${\cal X}=\{x_{1},\ldots,x_{n}\}$ , according to the probability distribution ${\bf p}=(p_{1},\ldots,p_{n})$ , and the function $f$ naturally induce a r.v. $f(X)$ , taking values in ${\cal Y}=\{y_{1},\ldots,y_{m}\}$ according to the probability distribution whose values are given by the expressions

[TABLE]

Let ${\bf z}=(z_{1},\ldots,z_{m})$ be the vector containing the values $z_{1}=P\{f(X)=y_{1}\},\ldots,z_{m}=P\{f(X)=y_{m}\}$ ordered in non-increasing fashion. For convenience, we state the following self-evident fact about the relationships between ${\bf z}$ and ${\bf p}$ .

Claim 1.

There is a partition of $\{1,\ldots,n\}$ into disjoint sets $I_{1},\ldots,I_{m}$ such that $z_{j}=\sum_{i\in I_{j}}p_{i}$ , for $j=1,\ldots m$ .

Therefore, ${\bf z}$ is an aggregation of ${\bf p}$ . Given a r.v. $X$ distributed according to ${\bf p}$ , and any $f\in{\cal F}_{m}$ , by simply applying the definition of majorization one can see that the (ordered) probability distribution of the r.v. $f(X)$ is majorized by $Q_{m}({\bf p})=(q_{1},\ldots,q_{m})$ , as defined in (4). Therefore, by invoking the Schur concavity of the entropy function $H$ (see [15], p. 101 for the statement, and [10] for an improvement), saying that $H({\bf a})\geq H({\bf b})$ whenever ${\bf a}\preceq{\bf b}$ , we get that $H(f(X))\geq H(Q_{m}({\bf p}))$ . From this, the equality (6) immediately follows.

We need the following two simple results, but important to us, stated and proved in [4] with a different terminology.

Lemma 1.

[4]* For ${\bf p}$ and ${\bf z}$ as above, it holds that ${\bf p}\preceq{\bf z}.$ *

In other words, for any r.v. $X$ and function $f$ , the probability distribution of $f(X)$ always majorizes that of $X$ .

Lemma 2.

[4]* For any $m$ , $2\leq m<n$ , and probability distribution ${\bf a}=(a_{1},\ldots,a_{m})$ such that ${\bf p}\preceq{\bf a}$ , it holds that*

[TABLE]

*where $R_{m}({\bf p})$ is the probability distribution defined in *(3). **

From Lemmas 1 and 2, and by applying the Schur concavity of the entropy function $H$ , we get the following result.

Corollary 2.

For any r.v. $X$ taking values in ${\cal X}$ according to a probability distribution ${\bf p}$ , and for any $f\in{\cal F}_{m}$ , it holds that

[TABLE]

Above corollary implies that

[TABLE]

Therefore, to complete the proof of Theorem 1 we need to show that we can construct a function $f\in{\cal F}_{m}$ such that

[TABLE]

or, equivalently, that we can construct an aggregation of ${\bf p}$ into $m$ components, whose entropy is at least $H(R_{m}({\bf p}))-\left(1-\frac{1+\ln(\ln 2)}{\ln 2}\right).$ We prove this fact in the following lemma.

Lemma 3.

For any ${\bf p}=(p_{1},\ldots,p_{n})$ and $2\leq m<n$ , we can construct an aggregation ${\bf q}=(q_{1},\ldots,q_{m})$ of ${\bf p}$ such that

[TABLE]

Proof:

We will assemble the aggregation ${\bf q}$ through the Huffman algorithm. We first make the following observation. To the purposes of this paper, each step of the Huffman algorithm consists in merging the two smallest element $x$ and $y$ of the current probability distribution, deleting $x$ and $y$ and substituting them with the single element $x+y$ , and reordering the new probability distribution from the largest element to the smallest (ties are arbitrarily broken). Immediately after the step in which $x$ and $y$ are merged, each element $z$ in the new and reduced probability distribution that finds itself positioned at the “right” of $x+y$ (if there is such a $z$ ) has a value that satisfies $(x+y)\leq 2z$ (since, by choice, $x,y\leq z$ ). Let ${\bf q}=(q_{1},\dots,q_{m})$ be the ordered probability distribution obtained by executing exactly $n-m$ steps of the Huffman algorithm, starting from the distribution ${\bf p}$ . Denote by $i_{q}$ the maximum index $i$ such that for each $j=1,\dots,i_{q}$ the component $q_{j}$ has not been produced by a merge operation of the Huffman algorithm. In other word, $i_{q}$ is the maximum index $i$ such that for each $j=1,\dots,i_{q}$ it holds that $q_{j}=p_{j}$ . Notice that we allow $i_{q}$ to be equal to [math]. Therefore $q_{i_{q}+1}$ has been produced by a merge operation. At the step in which the value $q_{i_{q}+1}$ was created, it holds that $q_{i_{q}+1}\leq 2z$ , for any $z$ at the “right” of $q_{i_{q}+1}$ . At later steps, the inequality $q_{i_{q}+1}\leq 2z$ still holds, since elements at the right of $q_{i_{q}+1}$ could have only increased their values.

Let $S=\sum_{k=i_{q}+1}^{m}q_{k}$ be the sum of the last (smallest) $m-i_{q}$ components of ${\bf q}$ . The vector ${\bf q}^{\prime}=(q_{i_{q}+1}/S,\dots q_{m}/S)$ is a probability distribution such that the ratio between its largest and its smallest component is upper bounded by 2. By Theorem 2, with $\rho=2$ , it follows that

[TABLE]

where $\alpha\leq\left(1-\frac{1+\ln(\ln 2)}{\ln 2}\right)<0.0861$ . Therefore, we have

[TABLE]

Let ${\bf q}^{*}=(q_{1},q_{2},\dots,q_{i_{q}},\frac{S}{m-i_{q}},\frac{S}{m-i_{q}},\dots,\frac{S}{m-i_{q}}),$ and observe that ${\bf q}^{*}$ coincides with ${\bf p}$ in the first $i_{q}$ components, as it does ${\bf q}$ . What we have shown is that

[TABLE]

We now observe that $i_{q}\leq i^{*}$ , where $i^{*}$ is the index that intervenes in the definition of our operator $R({\bf p})$ (see (3)). In fact, by the definition of ${\bf q}$ one has $q_{i_{q}}\geq q_{i_{q}+1}\geq\cdots\geq q_{m}$ , that also implies

[TABLE]

Moreover, since the first $i_{q}$ components of ${\bf q}$ are the same as in ${\bf p}$ , we also have $\sum_{j=i_{q}+1}^{m}q_{j}=\sum_{i_{q}+1}^{n}p_{j}$ . This, together with relation (14), implies

[TABLE]

Equation (15) clearly implies $i_{q}\leq i^{*}$ since $i^{*}$ is by definition, the maximum index $i$ such that $\sum_{j=i+1}^{n}p_{j}\geq(n-i)p_{i}.$ From the just proved inequality $i^{*}\geq i_{q}$ , we have also

[TABLE]

Using (13), (16), and the Schur concavity of the entropy function, we get

[TABLE]

thus completing the proof of the Lemma (and of Theorem 1). ∎

We now prove Theorem 2. Again, we use tools from majorization theory. Consider an arbitrary probability distribution ${\bf p}=(p_{1},p_{2},\ldots,p_{n})$ with $p_{1}\geq p_{2}\geq\ldots\geq p_{n}>0$ and $p_{1}/p_{n}\leq{\rho}$ . Let us define the probability distribution

[TABLE]

where $i=\left\lfloor{(1-np_{n})}/{p_{n}({\rho}-1)}\right\rfloor$ . It is easy to verify that $p_{n}\leq 1-(n+i({\rho}-1)-1)x\leq{\rho}p_{n}$ .

Lemma 4.

Let ${\bf p}=(p_{1},p_{2},\ldots,p_{n})$ with $p_{1}\geq p_{2}\geq\ldots\geq p_{n}>0$ be any probability distribution with $p_{1}/p_{n}\leq{\rho}$ . The probability distribution ${\bf z}_{\rho}({\bf p})$ satisfies ${\bf p}\preceq{\bf z}_{\rho}({\bf p}).$

Proof:

For any $j\leq i$ , it holds that

[TABLE]

Consider now some $j\geq i+1$ and assume by contradiction that $p_{1}+\ldots+p_{j}>z_{1}+\ldots+z_{j}$ . It follows that $p_{j+1}+\ldots+p_{n}<z_{j+1}+\ldots+z_{n}=(n-j)p_{n}$ . As a consequence we get the contradiction $p_{n}\leq(p_{j+1}+\ldots+p_{n})/(n-j)<p_{n}$ . ∎

Lemma 4 and the Schur concavity of the entropy imply that $H({\bf p})\geq H({\bf z}_{\rho}({\bf p}))$ . We can therefore prove Theorem 2 by showing the appropriate upper bound on $\log n-H({\bf z}_{\rho}({\bf p}))$ .

Lemma 5.

It holds that

[TABLE]

Proof:

Consider the class of probability distributions of the form

[TABLE]

having the first $i$ components equal to ${{\rho}x}$ and the last $n-i-1$ equal to $x$ , for suitable $0\leq x\leq 1/\rho$ , and $i\geq 0$ such that

[TABLE]

Clearly, for $x=p_{n}$ and $i=\left\lfloor{(1-np_{n})}/{p_{n}({\rho}-1)}\right\rfloor$ one has ${\bf z}_{\rho}({\bf p})={\bf z}_{\rho}(x,i)$ , and we can prove the lemma by upper bounding the maximum (over all $x$ and $i$ ) of $\log n-H({\bf z}_{\rho}(x,i))$ . Let

[TABLE]

From (18), for any value of $i\in\{1,\ldots,n-2\}$ , one has that

[TABLE]

Set $A=n+i({\rho}-1)-1$ . We have

[TABLE]

Since $\frac{d^{2}}{dx^{2}}f(x,i)\geq 0$ for any $x\in\left(\frac{1}{n+(i+1)({\rho}-1)},\frac{1}{n+i({\rho}-1)}\right]$ , the function is $\cup$ -convex in this interval, and it is upper bounded by the maximum between the two extrema values $f(1/(n+(i+1)({\rho}-1)),i)$ and $f(1/(n+i({\rho}-1)),i)$ . Therefore, we can upper bound $f(x,i)$ by the maximum value among

[TABLE]

for $i=1,\ldots,n-1$ . We now interpret $i$ as a continuous variable, and we differentiate $\log n+\frac{i{\rho}}{n+i({\rho}-1)}\log{\rho}+\log\frac{1}{n+i({\rho}-1)}$ with respect to $i$ . We get

[TABLE]

that is positive if and only if $i\leq\frac{n}{{\rho}-1}\left(\frac{{\rho}\ln{\rho}}{{\rho}-1}-1\right).$ Therefore, the desired upper bound on $f(x,i)$ can be obtained by computing the value of $f(\overline{x},\overline{\imath})$ , where $\overline{\imath}=\frac{n}{{\rho}-1}\left(\frac{{\rho}\ln{\rho}}{{\rho}-1}-1\right)$ and $\overline{x}=\frac{1}{n+\overline{\imath}({\rho}-1)}$ . The value of $f(\overline{x},\overline{\imath})$ turns out to be equal to

[TABLE]

∎

We conclude the paper by showing how Theorems 1 and 2 allow us to design an approximation algorithm for the second problem mentioned in Section III, that is, the problem of constructing a probability distribution $\overline{{\bf q}}=(\overline{q}_{1},\ldots,\overline{q}_{m})$ such that ${\tt D}({\bf p},\overline{{\bf q}})\leq{\tt D}({\bf p},{\bf q}^{*})+0.0861$ . Our algorithm improves on the result presented in [4], where an approximation algorithm for the same problem with an additive error of $1$ was provided.

Let ${\bf q}$ be the probability distribution constructed in Lemma 3 and let us recall that the first $i_{q}$ components of ${\bf q}$ coincide with the first $i_{q}$ components of ${\bf p}$ . In addition, for each $i=i_{q}+1,\dots,m,$ there is a set $I_{i}\subseteq\{i_{q}+1,\dots,n\}$ such that $q_{i}=\sum_{k\in I_{i}}p_{k}$ and the $I_{i}$ ’s form a partition of $\{i_{q}+1,\dots,n\},$ (i.e., ${\bf q}$ is an aggregation of ${\bf p}$ into $m$ components).

We now build a bivariate probability distribution ${{\bf M}}_{q}=[m_{ij}]$ , having ${\bf p}$ and ${\bf q}$ as marginals, as follows:

•

in the first $i_{q}$ rows and columns, the matrix ${{\bf M}}_{q}$ has non-zero components only on the diagonal, namely $m_{j\,j}=p_{j}=q_{j}$ and $m_{i\,j}=0$ for any $i,j\leq i_{q}$ such that $i\neq j$ ;

•

for each row $i=i_{q}+1,\dots,m$ the only non-zero elements are the ones in the columns corresponding to elements of $I_{i}$ and precisely, for each $j\in I_{i}$ we set $m_{i\,j}=p_{j}.$

It is not hard to see that ${\bf M}_{q}$ has ${\bf p}$ and ${\bf q}$ as marginals. Moreover we have that $H({\bf M}_{q})=H({\bf p})$ since by construction the only non-zero components of ${\bf M}_{q}$ coincide with the set of components of ${\bf p}.$ Let ${\cal C}({\bf p},{\bf q})$ be the set of all bivariate probability distribution having ${\bf p}$ and ${\bf q}$ as marginals. Recall that $\alpha=1-({1+\ln(\ln 2)})/{\ln 2}<0.0861$ . We have that

[TABLE]

where (19) is the definition of ${\tt D}({\bf p},{\bf q})$ ; (20) follows from (19) since ${\bf M}_{q}\in{\cal C}({\bf p},{\bf q})$ ; (21) follows from (20) because of $H({\bf M})=H({\bf p})$ ; (22) follows from Lemma 3; (23) follows from (22), the known fact that ${\bf q}^{*}$ is an aggregation of ${\bf p}$ (see [18]) and Lemmas 1 and 2. Finally, the general inequality $H({\bf a})-H({\bf b})\leq{\tt D}({\bf a},{\bf b})$ is formula (48) in [12].

Bibliography18

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J.C. Baez, T. Fritz and T. Leinster, “A Characterization of entropy in terms of information loss”, Entropy , vol. 17 , 772–789, 2015.
2[2] F. Cicalese and U. Vaccaro, “Supermodularity and subadditivity properties of the entropy on the majorization lattice”, IEEE Transactions on Information Theory , vol. 48 , 933–938, 2002.
3[3] F. Cicalese and U. Vaccaro, “Bounding the average length of optimal source codes via majorization theory”, IEEE Transactions on Information Theory , vol. 50 , 633–637, 2004.
4[4] F. Cicalese, L. Gargano, and U. Vaccaro, “Approximating probability distributions with short vectors, via information theoretic distance measures”, in: Proceedings of ISIT 2016 , pp. 1138-1142, 2016.
5[5] T. M. Cover and J. A. Thomas, Elements of Information Theory , Wiley-Interscience; 2nd edition (2006).
6[6] L. Faivishevsky and J. Faivishevsky, “Nonparametric information theoretic clustering algorithm”, in: Proceedings of the 27th International Conference on Machine Learning (ICML-10) , pp. 351–358, 2010.
7[7] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness , W. H. Freeman (1979).
8[8] B.C. Geiger and R.A. Amjad, “Hard Clusters Maximize Mutual Information”, ar Xiv:1608.04872 [cs.IT]

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

H(X)H(X)H(X) vs. H(f(X))H(f(X))H(f(X))

Abstract

I The Problem

II The Results

Theorem 1**.**

Corollary 1**.**

Theorem 2**.**

III Some Applications

IV The Proofs

Definition 1**.**

Claim 1**.**

Lemma 1**.**

Lemma 2**.**

Corollary 2**.**

Lemma 3**.**

Proof:

Lemma 4**.**

Proof:

Lemma 5**.**

Proof:

$H(X)$ vs. $H(f(X))$

Theorem 1.

Corollary 1.

Theorem 2.

Definition 1.

Claim 1.

Lemma 1.

Lemma 2.

Corollary 2.

Lemma 3.

Lemma 4.

Lemma 5.