Greedy-Merge Degrading has Optimal Power-Law
Assaf Kartowsky, Ido Tal

TL;DR
This paper proves that the greedy-merge algorithm for degrading channels is asymptotically optimal in reducing mutual information, with bounds close to the best possible, especially as the output alphabet size varies.
Contribution
The paper establishes that the greedy-merge algorithm achieves an optimal power-law bound on mutual information reduction, matching a fundamental lower bound.
Findings
Greedy-merge is within a constant factor of the optimal lower bound.
The bounds on mutual information reduction are tight in the power-law sense.
The results hold for fixed input alphabet size with varying output size.
Abstract
Consider a channel with a given input distribution. Our aim is to degrade it to a channel with at most L output letters. One such degradation method is the so called "greedy-merge" algorithm. We derive an upper bound on the reduction in mutual information between input and output. For fixed input alphabet size and variable L, the upper bound is within a constant factor of an algorithm-independent lower bound. Thus, we establish that greedy-merge is optimal in the power-law sense.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsError Correcting Code Techniques · Advanced Wireless Communication Techniques · Cellular Automata and Applications
Greedy-Merge Degrading has Optimal Power-Law
Assaf Kartowsky and Ido Tal
Department of Electrical Engineering
Technion - Haifa 32000, Israel
E-mail: {kartov@campus, idotal@ee}.technion.ac.il
Abstract
Consider a channel with a given input distribution. Our aim is to degrade it to a channel with at most output letters. One such degradation method is the so called “greedy-merge” algorithm. We derive an upper bound on the reduction in mutual information between input and output. For fixed input alphabet size and variable , the upper bound is within a constant factor of an algorithm-independent lower bound. Thus, we establish that greedy-merge is optimal in the power-law sense.
I Introduction
In myriad digital processing contexts, quantization is used to map a large alphabet to a smaller one. For example, quantizers are an essential building block in receiver design, used to keep the complexity and resource consumption manageable. The quantizer used has a direct influence on the attainable code rate.
Another recent application is related to polar codes [1]. Polar code construction is equivalent to evaluating the misdecoding probability of each channel in a set of synthetic channels. This evaluation cannot be carried out naively, since the output alphabet size of a synthetic channel is intractably large. One approach to circumvent this difficulty is to degrade the evaluated synthetic channel to a channel with manageable output alphabet size [2][3][4][5][6][7].
Given a design parameter , we degrade an initial channel to a new one with output alphabet size at most . We assume that the input distribution is specified, and note that this degradation reduces the mutual information between the channel input and output. In both examples above, this reduction is roughly the loss in code rate due to quantization. We denote the smallest reduction possible by .
Let denote the channel input alphabet size, and treat it as a fixed quantity. We show that for any input distribution and any initial channel, . Moreover, this bound is attained efficiently, by the greedy-merge algorithm [2][5]. This bound is tighter than the bounds derived in [3], [4], [5] and [6]. In fact, up to constant multipliers (dependent on ), this bound is the tightest possible. Namely, [8] proves the existence of an input distribution and a sequence of channels for which . Both bounds have as the power of , the same power-law. Note that for noisy channels and a relatively small our bound can be tightened [9]. See also [10], which is especially relevant in the context of small .
II Preliminaries
We are given an input distribution and a discrete memoryless channel (DMC) . Both and are assumed finite. Let and denote the random variables that correspond to the channel input and output, respectively. Denote the corresponding distributions and . Let . For brevity, let . Assuming further that and are disjoint, we abuse notation and denote and as and , respectively. Without loss of generality, and . We do not assume that is symmetric.
The mutual information between channel input and output is
[TABLE]
where for , zero for , and the logarithm is taken in the natural basis. We note that the input distribution does not necessarily have to be the one that achieves the channel capacity.
We now define the relation of degradedness between channels. A channel is said to be (stochastically) degraded with respect to a channel , and we write , if there exists a channel such that
[TABLE]
for all and . Note that as a result of the data processing theorem, implies .
Although mentioned before, let us properly define the optimal degrading loss for a given pair as
[TABLE]
where denotes the output alphabet size of the channel . The optimizer is the degraded channel that is “closest” to in the sense of mutual information, yet has at most output letters.
III Main result
Our main result is an upper bound on , in terms of and . This upper bound will follow from analyzing a sub-optimal111For the binary-input case, optimal degrading can be realized through dynamic programming [11][12]. For the non-binary case, we do not know of an efficient realization of optimal degrading. degrading algorithm, called “greedy-merge”. In each iteration of greedy-merge, we merge the two output letters that result in the smallest decrease of mutual information between input and output, denoted . Namely, the intermediate channel maps and to a new symbol, while all other symbols are unchanged by . This is repeated times, to yield an output alphabet size of . By upper bounding the of each iteration we obtain an upper bound on . A key result is the following theorem, stating that there exists a pair of output letters whose merger yields a “small” .
Theorem 1**.**
Let a DMC satisfy , and let the input distribution be fixed. There exists a pair whose merger results in a channel satisfying . In particular,
[TABLE]
where,
[TABLE]
and is the Gamma function.
Recall that Theorem 1 is referring to the merger of a single pair of output letters. The following corollary is our main result, and is basically an iterative utilization of Theorem 1.
Corollary 2**.**
Let a DMC satisfy and let . Then, for any fixed input distribution ,
[TABLE]
In particular, , where , and was defined in Theorem 1. This bound is attained by greedy-merge, and is tight in the power-law sense.
Proof.
If , then obviously which is not the interesting case. If , then applying Theorem 1 repeatedly times yields
[TABLE]
by the monotonicity of . The bound is tight in the power-law sense, by [8, Theorem 2]. ∎
Note that for large values of , the Stirling approximation along with some other first order approximations can be applied to simplify to .
IV Proof of Theorem 1
The proof of Theorem 1 will follow from a sphere-packing argument. In the following subsections we define a “distance” function, overcome it not being a metric, and assign different “weights” to different spheres. See [13] for more commentary.
IV-A An alternative “distance” function
Consider the merger of a pair of output letters . The new output alphabet of is . The channel then satisfies , whereas for all we have . Using the shorthand
[TABLE]
one gets that . Denote by , and the vectors corresponding to posterior probabilities associated with and , respectively. Namely, , , and
[TABLE]
Thus, after canceling terms, one gets that
[TABLE]
where .
In order to bound , we give two bounds on . The first bound was derived in [5],
[TABLE]
where for and , we define .
The subscript “” in is suggestive of the distance. We will use to denote a probability associated with an input letter, while will denote a “free” real variable, possibly negative. Note that the bound in (6) was derived assuming a uniform input distribution, however remains valid for the general case.
We now derive the second bound on . For the case where ,
[TABLE]
where in we used the concavity of , in the definition of (see (4)), and in the AM-GM inequality and the mean value theorem where for some . Using the monotonicity of we get . Thus,
[TABLE]
where
[TABLE]
The subscript “” in is suggestive of the squaring in the numerator. Combining (6) and (7) yields
[TABLE]
where
[TABLE]
Returning to (5) using (8) we get
[TABLE]
where
[TABLE]
We note that we use in (11) instead of a summation to simplify the upcoming derivations. Moreover, according to (10), it suffices to show the existence of a pair that is “close” in the sense of , assuming that are also small enough.
Since we are interested in lowering the right hand side of (10), we limit our search to a subset of , as was done in [5]. Namely, , which implies
[TABLE]
Hence, and
[TABLE]
We still need to prove the existence of a pair that is “close” in the sense of . To that end, as in [5], we would like to use a sphere-packing approach. A typical use of such an argument assumes a proper metric, yet is not a metric. Specifically, the triangle-inequality does not hold. The absence of a triangle-inequality is a complication that we will overcome, but some care and effort are called for. Broadly speaking, as usually done in sphere-packing, we aim to show the existence of a critical “sphere” radius, . Such a critical radius will ensure the existence of with corresponding and for which .
IV-B Non-intersecting “spheres”
We start by giving explicit equations for the “spheres” corresponding to and .
Lemma 3**.**
For and , define the sets as
[TABLE]
Then,
[TABLE]
and
[TABLE]
Proof.
Assume . Then satisfies , which is equivalent to , and we get the desired result for . Assume now . If , then , and thus , which implies . If , then , and thus, , which implies . The union of the two yields the desired result for . ∎
Thus, we define , and note that , since takes the of the two distances. Namely,
[TABLE]
where and . To extend to vectors, we define as the set of vectors with real entries that are indexed by , . The set is defined as the set of vectors from with entries summing to , . The set is the set of probability vectors. Namely, the set of vectors from with non-negative entries, . We can now define . For let
[TABLE]
Using (11) and (14) we have a simple characterization of as a box: a Cartesian product of segments. That is,
[TABLE]
We stress that the box contains , but is not necessarily centered at it.
Recall our aim is finding an . Using our current notation, must imply the existence of distinct such that . Note that the set is contained in . However, since the boxes are induced by points in the subspace of , the sphere-packing would yield a tighter result if performed in rather than in . Then, for and , let us define
[TABLE]
When considering in place of , we have gained in that the affine dimension (see [14, Section 2.1.3]) of is while that of is . However, we have lost in simplicity: the set is not a box. Indeed, a moment’s thought reveals that any subset of with more than one element cannot be a box.
We now show how to overcome the above loss. That is, we show a subset of which is — up to a simple transform — a box. Denote the index of the largest entry of a vector as , namely, . In case of ties, define in an arbitrary yet consistent manner. For given, or clear from the context, define as , with index deleted. That is, for a given , , where . Note that for , all the entries sum to one. Thus, given and , we know . Next, for and , define the set
[TABLE]
where and
[TABLE]
Lemma 4**.**
Let and be given. Let . Then, .
Proof.
It can be easily shown that . Thus, since (18) holds, it suffices to show that
[TABLE]
Indeed, summing the condition in (18) over all gives
[TABLE]
Since is a monotonically non-decreasing function of , we can simplify the above to
[TABLE]
Since both and are in , the middle term in the above is . Thus, (20) follows. ∎
Recall that our plan is to ensure the existence of a “close” pair by using a sphere-packing approach. However, since the triangle inequality does not hold for , we must use a somewhat different approach. Towards that end, define the positive quadrant associated with and as
[TABLE]
where and is as defined in (19).
Lemma 5**.**
Let be such that . If and have a non-empty intersection, then .
Proof.
By (15), (17), and Lemma 4, it suffices to prove that . Define as the result of applying a prime operation on each member of , where . Hence, we must equivalently prove that . By (18), we must show that for all ,
[TABLE]
Since we know that the intersection of and is non-empty, let be a member of both sets. Thus, we know that for , , and . For each we must consider two cases: and .
Consider first the case . Since and , we conclude that . Conversely, since and, by (19), , we have that . Thus we have shown that both inequalities in (21) hold.
To finish the proof, consider the case . We have already established that . Thus, since by assumption , we have that . Conversely, since and , we have that . We now recall that by (19), the fact that implies that . Thus, . Negating gives , and we have once again proved the two inequalities in (21). ∎
IV-C Weighted “sphere”-packing
The volume of our “sphere” unfortunately depends on . We would like then to alleviate this dependency by defining a density over and derive a lower bound on the weight of . Let be defined as . Next, for , abuse notation and define as . The weight of is then defined as . The following lemma proposes a lower bound on that does not depend on .
Lemma 6**.**
The weight satisfies
[TABLE]
Proof.
Since is a product,
[TABLE]
where . It can be shown that is decreasing when simply by using the first derivative. As for , it can be shown that is non-zero. Since we conclude that is increasing. By continuity we conclude that is minimal for and thus we get (22). ∎
We divide the letters in to subsets, according to their value. The largest subset is denoted by , and we henceforth fix accordingly. We limit our search to .
Let be the union of all the quadrants corresponding to possible choices of . Namely,
[TABLE]
In order to bound the weight of , we introduce the simpler set .
[TABLE]
The constraint in the following lemma will be motivated shortly.
Lemma 7**.**
Let . Then, .
Proof.
Assume . Then, there exists such that for all . Hence, for all . Moreover,
[TABLE]
There are two cases to consider. In the case where we have
[TABLE]
where the second inequality is due to the assumption . In the case where , (IV-C) becomes
[TABLE]
where we assumed . Therefore, . ∎
The lemma above and the non-negativity of , enable us to upper bound the weight of , denoted by , using . We define the mapping for all and perform a change of variables. As a result, is mapped to , which is a quadrant of a dimensional ball of a radius. The density function transforms into the unit uniform density function since . Hence, for ,
[TABLE]
where we have used the well known expression for the volume of a multidimensional ball. Finally, we prove Theorem 1.
Proof of Theorem 1.
Recall that we are assuming . According to the definition of , we get by (12) that
[TABLE]
As a result, we have at least two points in , and are therefore in a position to apply a sphere-packing argument. Towards this end, let be such that the starred equality in the following derivation holds:
[TABLE]
Namely,
[TABLE]
There are two cases to consider. If , then all of (26) holds, by (22), (24) and (25). We take , and deduce the existence of a pair for which . Indeed, assuming otherwise would contradict (26), since each in the sum is contained in , and, by Lemma 5 and our assumption, all summed are disjoint.
We next consider the case . Now, any pair of letters satisfies . Indeed, by (9) and (11),
[TABLE]
where is the maximum norm.
We have proved the existence of for which . By (13) and (27), the proof is finished. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] E. Arıkan, “Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels,” IEEE Trans. Inf. Theory , vol. 55, no. 7, pp. 3051–3073, July 2009.
- 2[2] I. Tal and A. Vardy, “How to construct polar codes,” IEEE Trans. Inf. Theory , vol. 59, no. 10, pp. 6562–6582, October 2013.
- 3[3] R. Pedarsani, S. H. Hassani, I. Tal, and E. Telatar, “On the construction of polar codes,” in 2011 IEEE Int’l Symp. on Inf. Theory (ISIT) , July 2011, pp. 11–15.
- 4[4] I. Tal, A. Sharov, and A. Vardy, “Constructing polar codes for non-binary alphabets and MA Cs,” in 2012 IEEE Int’l Symp. on Inf. Theory (ISIT) , July 2012, pp. 2132–2136.
- 5[5] T. C. Gulcu, M. Ye, and A. Barg, “Construction of polar codes for arbitrary discrete memoryless channels,” in 2016 IEEE Int’l Symp. on Inf. Theory (ISIT) , July 2016, pp. 51–55.
- 6[6] U. Pereg and I. Tal, “Channel upgradation for non-binary input alphabets and MA Cs,” IEEE Trans. Inf. Theory , vol. 63, no. 3, pp. 1410–1424, March 2017.
- 7[7] I. Tal and A. Vardy, “Channel upgrading for semantically-secure encryption on wiretap channels,” in 2013 IEEE Int’l Symp. on Inf. Theory (ISIT) , July 2013, pp. 1561–1565.
- 8[8] I. Tal, “On the construction of polar codes for channels with moderate input alphabet sizes,” IEEE Trans. Inf. Theory , vol. 63, no. 3, pp. 1501–1509, March 2017.
