How to Quantize $n$ Outputs of a Binary Symmetric Channel to $n-1$ Bits?
Wasim Huleihel, Or Ordentlich

TL;DR
This paper investigates the maximum mutual information achievable by an (n-1)-bit quantizer of a binary symmetric channel output, establishing that the optimal quantizer simply outputs the first n-1 bits of the input.
Contribution
It proves an upper bound on mutual information for (n-1)-bit quantizers, showing the optimal quantizer is a simple truncation of the input vector.
Findings
Maximum mutual information is bounded by (n-1) times (1 - h(α)).
The optimal quantizer is the truncation of the first n-1 bits.
The result extends understanding of information preservation in binary symmetric channels.
Abstract
Suppose that is obtained by observing a uniform Bernoulli random vector through a binary symmetric channel with crossover probability . The "most informative Boolean function" conjecture postulates that the maximal mutual information between and any Boolean function is attained by a dictator function. In this paper, we consider the "complementary" case in which the Boolean function is replaced by , namely, an bit quantizer, and show that for any such . Thus, in this case, the optimal function is of the form .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
How to Quantize Outputs of a Binary Symmetric Channel to Bits?
Wasim Huleihel
MIT
Or Ordentlich
The work of W. Huleihel and O. Ordentlich was supported by the MIT - Technion Postdoctoral Fellowship.
MIT
Abstract
Suppose that is obtained by observing a uniform Bernoulli random vector through a binary symmetric channel with crossover probability . The “most informative Boolean function” conjecture postulates that the maximal mutual information between and any Boolean function is attained by a dictator function. In this paper, we consider the “complementary” case in which the Boolean function is replaced by , namely, an bit quantizer, and show that for any such . Thus, in this case, the optimal function is of the form .
I Introduction
Let be an -dimensional binary vector uniformly distributed over , and let be the output of passing through a binary symmetric channel (BSC) with crossover probability . In other words, , where is a sequence of independent and identically distributed (i.i.d.) random variables, statistically independent of . The following conjecture [1] have recently received considerable attention.
Conjecture 1
For any Boolean function , we have , where is the binary entropy function.
Since the dictator function (for any ) achieves this upper bound with equality, then intuitively Conjecture 1 postulates that the dictator function is the most “informative” one-bit quantization of in terms of achieving the maximal . Clearly, by the symmetry of the pair we have that for any function , so we can equivalently think of the problem at hand as seeking the optimal one-bit quantizer of outputs of the channel. Despite attempts in various directions [1, 2, 3, 4, 5, 6, 7], Conjecture 1 remains open in general. However, for the “very noisy” case, where , for some independent of , the validity of the conjecture was established by Samorodnitsky [8].
In this paper, we consider the “complementary” case in which the Boolean function in Conjecture 1 is replaced by an bit quantizer. Our main result is the following.
Theorem 1
For any function we have
[TABLE]
and this bound is attained with equality by, e.g., .
One may wonder whether for any we have . However, for with , the problem essentially reduces to remote source coding under log-loss distortion measure, for which the maximal value of (as a function of ) can be determined up to terms. Indeed, [9, 3] characterizes this quantity which turns out to be greater than . Conjecture 1 as well as Theorem 1 deal with the extreme cases of and , respectively, where neglecting the terms leads to non-informative characterization of the maximal , and therefore [3, 9] do not suffice.
Theorem 1 can be generalized to a stronger statement concerning the entire class of binary-input memoryless output-symmetric (BMS) channels.
Definition 1** (BMS channels)**
A memoryless channel with binary input and output is called binary-input memoryless output-symmetric (BMS) if there exists a sufficient statistic for , where are statistically independent of , and is a binary random variable with .
Corollary 1** ([10])**
Let be an -dimensional binary vector uniformly distributed over , and let be the output of passing through a BMS with capacity . Then for every , we have
[TABLE]
and this bound is attained with equality by, e.g., .
Proof:
Let be a BMS channel with capacity . Let and be the outputs corresponding to the channel and a BSC with crossover probability , respectively, when the input to both channels is . Define . For any function , we can write
[TABLE]
We proceed by noting that as the capacity achieving input distribution of both channels is . Furthermore, recall the fact that the BSC is the least capable among all BMS channels with the same capacity [11, page 116],[12, Lemma 7.1]. To wit, for any input , the corresponding outputs of and the BSC will satisfy . This implies that
[TABLE]
for all . Thus, we get that for any function ,
[TABLE]
The corollary now follows by invoking Theorem 1. ∎
II Proof of Theorem 1
Since the vector is uniformly distributed over , we have
[TABLE]
Our goal is therefore to lower bound .
Consider the function , and define the sets
[TABLE]
which form a disjoint partition of . Further, define the sizes of these sets as
[TABLE]
and assume without loss of generality that , for all . To see why this assumption is valid, first note that there must exist some , for which . Let . Now if there exists some , such that , we can define a new function where , , and , for all . For this function we must have
[TABLE]
and consequently .
Next, for every define the quantity
[TABLE]
which counts the number of sets with cardinality , in the partition induced by the function .111In fact, since we have already assumed that for all , we have that and for . The next proposition expresses in terms of .
Proposition 1
For any with for all , we have that
[TABLE]
Intuitively, this proposition states that since the average size of the sets is , then every set of cardinality , must be compensated for by sets of cardinality .
Proof:
Using the definition of in (4), and the fact that forms a disjoint partition of , we have
[TABLE]
Multiplying (6) by and equating it with the left-hand side of (7), we get
[TABLE]
which implies
[TABLE]
Invoking our assumption that gives the desired result. ∎
Definition 2** (Minimal entropy of a noisy subset)**
For a family of vectors let be a random vector uniformly distributed over , and let be a sequence of i.i.d. random variables, statistically independent of . For , we define the quantity
[TABLE]
Some properties of will be studied in the next section. In particular, we will prove the following lemma.
Lemma 1
For any ,
[TABLE]
We can now write
[TABLE]
[TABLE]
where in (10) follows from Proposition 1, in (11) we have used Lemma 1, and (12) follows from (6) and (7). Proposition 4, stated and proved in the next section, shows that . Combining this with (3) and (12) establishes the desired result.
III Properties of
The main goal of this section is to prove Lemma 1. To this end, we establish some properties of the function , which may be of independent interest.
Proposition 2** (Monotonicity in )**
The function is monotonically non-decreasing as a function of .
Proof:
It is suffice to show that for any natural number it holds that . To this end, let be a family of vectors, and let , for . Clearly, for all . Furthermore, the random vector can be generated by first drawing a random variable and then drawing a statistically independent random vector uniformly over . Thus, for any of size we have that
[TABLE]
and in particular . ∎
We define the partial order “” on the hypercube as iff , for all .
Definition 3** (Monotone sets)**
A set is monotone if implies , for all .
Let . We will prove the following result.
Lemma 2** (Sufficiency of monotone sets)**
[TABLE]
Remark 1
Theorem 3 in [1] states that among all boolean functions, is maximized by functions for which the induced set is monotone.222In fact, [1, Theorem 3] provides a stronger statement about the structure of the induced . While this statement is closely related to our Lemma 2, it does not imply it, although the proof technique is somewhat similar.
The proof of Lemma 2 is based on applying a procedure called shifting [13, 14, 15].
Definition 4** (Shifting)**
For a set of binary vectors the shifting procedure is defined as follows. For and write for the vector obtained by setting , and define
[TABLE]
Find the smallest such that . If there is no such then we are done. Otherwise, replace with the set , where , and repeat. The output of this process is a monotone set, denoted by , with cardinality .
The proof of Lemma 2 hinges on the following result.
Lemma 3
Let be some subset of vectors, and be the result of applying one iteration of the shifting procedure, say, on the first coordinate. Let be some discrete memoryless channel with binary input, and let be its output when the input is and be its output when the input is . For every we have that , and
[TABLE]
Proof:
Let be the projection of onto the coordinates , and note that the projection of onto these coordinates is also , as the shifting operations does not effect these coordinates. Consequently, and have the same distribution, and therefore and have the same distribution.
Next, for any vector , we have
[TABLE]
The fact that and have the same distribution, implies that , and therefore
[TABLE]
We partition the set into three subsets:
- •
- •
- •
and we note that
[TABLE]
Letting
[TABLE]
we get
[TABLE]
By the definition of the shifting procedure in Definition 4,
[TABLE]
Thus,
[TABLE]
We can use this to see that is more biased than . Indeed
[TABLE]
as desired. ∎
Corollary 2** (Shifting decreases output entropy)**
Let be some subset of vectors, and be the result of applying one iteration of the shifting procedure, say, on the first coordinate. Let be a sequence of i.i.d. random variables, statistically independent of and . Then,
[TABLE]
Proof:
By the chain rule,
[TABLE]
and
[TABLE]
where the last equality follows from the fact that due to Lemma 3. Thus, it suffices to show that
[TABLE]
For any let and . Then, we get
[TABLE]
where for any , the second equality follows since , and the inequality is because is more biased than , by Lemma 3. ∎
Applying Corollary 2 recursively, we see that for any we have
[TABLE]
In fact, it is easy to extend the above argument to show that for any BMS channel with inputs and and corresponding outputs and , respectively, we get . Inequality (15) immediately establishes Lemma 2.
We now turn to finding for .
Proposition 3
.
Proof:
For any vector we have that . ∎
Proposition 4
.
Proof:
By Lemma 2, it is suffice to minimize over . It is easy to see that consists of a single set , up to permuting the order of coordinates. Thus, direct calculation gives
[TABLE]
∎
Proposition 5
[TABLE]
Proof:
By Lemma 2, it is suffice to minimize over . It is easy to see that consists of a single set , up to permuting the order of coordinates. Thus, (17) is obtained by direct calculation of . To obtain the lower bound (18) we write
[TABLE]
∎
Proposition 6
.
Proof:
By Lemma 2, it is suffice to minimize over . It is easy to see that consists of two sets
[TABLE]
up to permuting the order of coordinates. In particular, is the -dimensional cube padded by zeros, whereas is the -dimensional Hamming ball of radius , padded by zeros. Thus,
[TABLE]
It is easy to verify that . We show that . Indeed,
[TABLE]
Direct calculation gives
[TABLE]
which together with (19) shows that . ∎
We are now in a position to prove Lemma 1.
Proof:
For any we have that , which implies that
[TABLE]
It then remains to verify (9) for . Using the lower bound (18) for , it suffices to verify that
[TABLE]
which is equivalent to
[TABLE]
Let . It is easy to check that and that . Thus, it suffices to show that is monotonically decreasing as a function of , namely, that , for any . We have
[TABLE]
which is negative for all . ∎
Acknowledgment
The authors are grateful to Yury Polyanskiy, Shlomo Shamai (Shitz), Ofer Shayevitz, and Omri Weinstein, for many discussions that helped prompt this work.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] T. Courtade and G. Kumar, “Which Boolean functions maximize mutual information on noisy inputs?” IEEE Transactions on Information Theory , vol. 60, no. 8, pp. 4515–4525, Aug 2014.
- 2[2] V. Anantharam, A. Gohari, S. Kamath, and C. Nair, “On hypercontractivity and the mutual information between Boolean functions,” in 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton) , Oct 2013, pp. 13–19.
- 3[3] V. Chandar and A. Tchamkerten, “Most informative quantization functions,” in Proc. ITA Workshop, San Diego, CA, USA , Feb. 2014, available online http://perso.telecom-paristech.fr/ tchamker/CTAT.pdf.
- 4[4] O. Ordentlich, O. Shayevitz, and O. Weinstein, “An improved upper bound for the most informative boolean function conjecture,” in IEEE International Symposium on Information Theory (ISIT) , July 2016, pp. 500–504.
- 5[5] G. Kindler, R. O’Donnell, and D. Witmer, “Continuous analogues of the most informative function problem,” 2015. [Online]. Available: http://arxiv.org/abs/1506.03167
- 6[6] N. Weinberger and O. Shayevitz, “On the optimal boolean function for prediction under quadratic loss,” in IEEE International Symposium on Information Theory (ISIT) , July 2016, pp. 495–499.
- 7[7] G. Pichler, G. Matz, and P. Piantanida, “A tight upper bound on the mutual information of two boolean functions,” in IEEE Information Theory Workshop (ITW) , Sept 2016, pp. 16–20.
- 8[8] A. Samorodnitsky, “On the entropy of a noisy function,” IEEE Transactions on Information Theory , vol. 62, no. 10, pp. 5446–5464, Oct 2016.
