Strong Converses Are Just Edge Removal Properties
Oliver Kosut, Joerg Kliewer

TL;DR
This paper establishes a fundamental link between edge removal properties and strong converses in network information theory, showing their equivalence under certain conditions and applying this to key network models.
Contribution
It introduces a novel causal blowing-up lemma and proves the equivalence between weak edge removal and exponentially strong converse for discrete memoryless networks.
Findings
Weak edge removal implies exponentially strong converse.
Exponential strong converse holds for the 2-user interference channel with strong interference.
Relations between various notions of edge removal and strong converses are characterized.
Abstract
This paper explores the relationship between two ideas in network information theory: edge removal and strong converses. Edge removal properties state that if an edge of small capacity is removed from a network, the capacity region does not change too much. Strong converses state that, for rates outside the capacity region, the probability of error converges to 1 as the blocklength goes to infinity. Various notions of edge removal and strong converse are defined, depending on how edge capacity and error probability scale with blocklength, and relations between them are proved. Each class of strong converse implies a specific class of edge removal. The opposite directions are proved for deterministic networks. Furthermore, a technique based on a novel, causal version of the blowing-up lemma is used to prove that for discrete memoryless networks, the weak edge removal property--that the…
| Finite blocklength rate region for network | |
|---|---|
| Blocklength | |
| Average probability of error | |
| Number of bits carried by edge in the modified network as shown in Fig. 2. If omitted then the network is unmodified (i.e., ) | |
| Set of nodes in connected to extra nodes and . If omitted then ; i.e., and connect to all nodes | |
| Asymptotic capacity region for network | |
| Probability of error sequence as a function of blocklength . If replaced by then asymptotically vanishing error probability | |
| Bit-capacity sequence of edge as a function of blocklength . If omitted then the network is unmodified (i.e., for all ) | |
| See above |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Strong Converses Are Just Edge Removal Properties
Oliver Kosut, and Jörg Kliewer O. Kosut is with the School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ 85287 USA (email: [email protected]).J. Kliewer is with the Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102 USA (email: [email protected]).This work was presented in part at the 2016 IEEE International Symposium on Information Theory.This material is based upon work supported by the National Science Foundation under Grant No. CCF-1439465, CCF-1440014, CNS-1526547, CCF-1453718.
Abstract
This paper explores the relationship between two ideas in network information theory: edge removal and strong converses. Edge removal properties state that if an edge of small capacity is removed from a network, the capacity region does not change too much. Strong converses state that, for rates outside the capacity region, the probability of error converges to 1 as the blocklength goes to infinity. Various notions of edge removal and strong converse are defined, depending on how edge capacity and error probability scale with blocklength, and relations between them are proved. Each class of strong converse implies a specific class of edge removal. The opposite directions are proved for deterministic networks. Furthermore, a technique based on a novel, causal version of the blowing-up lemma is used to prove that for discrete memoryless networks, the weak edge removal property—that the capacity region changes continuously as the capacity of an edge vanishes—is equivalent to the exponentially strong converse—that outside the capacity region, the probability of error goes to 1 exponentially fast. This result is used to prove exponentially strong converses for several examples, including the discrete 2-user interference channel with strong interference, with only a small variation from traditional weak converse proofs.
Index Terms: Strong converse, edge removal, network information theory, reduction results, blowing-up lemma.
I Introduction
Consider a general network communication scenario given an arbitrary collection of sources and sinks connected via an arbitrary network channel. The sources are independent and each source is demanded by a subset of sinks, where this subset can be different for each sink. A general interest in network information theory is to determine the capacity of such networks, defined as the set of achievable rates for each source. As this problem is known to be challenging, we consider the simpler problem of how the capacity of these networks change if only a single edge is removed from the network. This problem has first been studied by [1, 2]. The authors have shown that for acyclic noiseless networks and a variety of demand types for which the cut-set bound is tight, removing an edge of capacity reduces the capacity of each min-cut by at most in each dimension. Further, in [3] it has been shown for a noiseless multiple multicast demand that this edge removal property also holds for generalized network sharing outer bound [4]; for the linear programming outer bound [5], [3] shows that removing an edge of capacity reduces the capacity by at most , where depends only on the network. In addition, the existence of the edge removal property has for example been tied to the problem whether a network coding instance allows a reconstruction with or zero error [6, 7], respectively. Another example is the connection of edge removal to the equivalency between a network coding instance and a corresponding index coding problem [8]. Recently, it has been shown that for a multiple-access channel with a so called “cooperation facilitator” [9, 10, 11, 12, 13] the edge removal property does not hold. In particular, for this setting the authors show the surprising result that adding a small capacity edge can lead to a significant increase in network capacity. These results have also been extended to networks with state [14] and to edges which can carry only a single bit over all times under the maximal error criterion [15]. However, despite the significant progress that has been made to understand scenarios in which the edge removal property holds, the solution to the general problem is open.
In this work, we address the connection of edge removal to the existence of strong converses for networks subject to an average probability of error constraint. As far as we know, this connection has been explored in the literature only briefly in [16, Chap. 3, p. 48]. The strong converse theorem states that the error probability converges to 1 for large blocklengths if the rate exceeds the capacity. This is in contrast to a weak converse which only indicates that the error probability is bounded away from zero if we operate at a rate beyond capacity. The benefit of a strong converse is that it strengthens the interpretation of capacity as a sharp phase transition in achievable probability of error. It also allows for the following interesting interpretation: if a strong converse exists for a given network instance, reliable codes (i.e., codes which allow reconstruction with error) must have rate tuples within the capacity region for and large . Thus, a strong converse refines a capacity (or first-order) result, which provides only the limiting behavior as the probability of error vanishes and the blocklength goes to infinity. However, a strong converse does not provide as much refinement as a second-order (or dispersion) result [17], which clarifies the (usually ) backoff from capacity for small blocklengths and fixed probability of error. Therefore, strong converses constitute “one-and-a-half-th order” results. Strong converses have been established for numerous problems, including point-to-point settings, e.g., for discrete memoryless channels [18] and quantum channels [19, 20]. Recently it has been shown that a strong converse holds for a discrete memoryless networks with tight cut-set bounds [21]. There has also been work establishing exponentially strong converses, which state that for any rate vector outside the asymptotically-zero error capacity region, the error probability approaches 1 exponentially fast. Exponentially strong converses have been considered for point-to-point channels in [22, 23], and for several network problems in [24, 25, 26, 27].
In the following, we categorize the notions of edge removal and strong converses into different classes depending on how edge capacity and error probability, resp., scale with blocklength, and demonstrate relations between these instances. See Fig. 1 for a summary of our results. In particular, our contributions are as follows:
We show that each specific class of strong converse always implies a specific class of edge removal. This implication holds in great generality: whether the network channel model is deterministic or probabilistic, discrete or continuous, or even whether it has memory. 2. 2.
We show that implications in the opposite direction (edge removal implies strong converse) hold in some cases. In particular, we show that each opposite direction holds for deterministic networks. However, these opposite directions do not always hold; for example, for a simple discrete memoryless point-to-point channel, each edge removal property holds, but the strongest form of the strong converse—the extremely strong converse—does not hold. 3. 3.
We further show that for all discrete memoryless stationary networks, the exponentially strong converse is equivalent to the weak edge removal property. The weak edge removal property states that if a small edge with rate growing sublinear in the blocklength is removed, the asymptotically-zero error capacity region does not change. The proof is based on a novel, causal version of the blowing-up lemma [28]. 4. 4.
We demonstrate that for networks composed of independent point-to-point links with acyclic topology, a similar equivalence holds for weaker conditions—between the ordinary strong converse and what we call the very weak edge removal property, wherein the edge carries an unbounded number of bits that grows very slowly with blocklength. 5. 5.
These results, particularly the equivalence between weak edge removal and the exponentially strong converse, enable us to, without much effort, strengthen many existing computable outer bounds or weak converses to prove that they hold in an exponentially strong sense. We demonstrate this for the cut-set bound, reproducing the result of [21] to show that for rates outside the region defined by cut-set bound, the probability of error converges to 1 exponentially fast. We also prove exponentially strong converses for discrete broadcast channels, and for the discrete 2-user interference channel with strong interference.
All the above mentioned reduction results between edge removal and strong converses reveal the surprising fact that for many cases, satisfying edge removal—a condition related only to first-order capacity—implies a seemingly stronger “one-and-a-half-th order” property, namely the existence of a specific version of a strong converse indicated by the leftward arrows in Fig. 1. This highlights again the power of the edge removal property.
This paper is organized as follows. We first introduce the model and definitions of various strong converse and edge removal properties in Sec. II. After that, in Sec. III we prove that strong converses imply edge removal properties. The opposite directions for deterministic networks is then proven in Sec. IV. Then, in Sec. V we prove one of the main results in this paper, namely equivalence between weak edge removal and the exponentially strong converse for discrete stationary memoryless. We then show equivalence between very weak edge removal and the ordinary strong converse for networks of independent point-to-point links in Sec. VI. After that, in Sec. VII we derive several applications of our results, including the cut-set bound, broadcast channels, and interference channel. Finally, Sec. VIII offers the conclusions.
II Model and Definitions
We begin by introducing notation to be used throughout the paper. Subsequently we introduce our network model, and formally define the notions of strong converse and edge removal that will be the main focus, while proving some simple properties of these definitions. There are number of subtly different definitions of rate regions: we summarize them in Table I for convenience.
Notation: For an integer we define . All logarithms and exponentials have base . The notation represents an infinite sequence of values for each positive integer . For sequences , we write if and have the same limit as . Given two probability distributions and on the same alphabet , the relative entropy (for discrete distributions) is given by
[TABLE]
Given conditional distributions and , and marginal distribution , the conditional relative entropy is given by
[TABLE]
The total variational distance (for discrete distributions) is given by
[TABLE]
The Hamming distance between two sequences is denoted
[TABLE]
For a set , indicates the closure of with respect to the Euclidean distance. We denote the set of nonnegative real numbers by . Given a vector and a scalar , we denote the vector-scalar sum as
[TABLE]
Given a sets we denote the set sum as
[TABLE]
II-A Network Model
We begin with a network model for an arbitrary causal network channel. Many of our results apply only for discrete memoryless networks or deterministic networks, but some basic results apply in much more generality.
Consider a network consisting of nodes, where node wishes to convey a message at rate to a set of destination nodes .111We assume for simplicity that at most one message originates at each node; all results can be easily generalized to the scenario in which multiple messages originate at each node. The channel model consists of:
- •
An input alphabet for each ,
- •
An output alphabet for each ,
- •
For each time step , a conditional probability measure
[TABLE]
Note that the channel outputs at time depend on all previous inputs up to time , and all previous outputs up to time .
Definition 1
A network is memoryless and stationary if the probability measure in (7) can be written as
[TABLE]
and these distributions are the same for all .
Definition 2
A network is deterministic if the channel outputs at time are fixed given the channel inputs up to time ; i.e., the conditional probability distribution in (7) takes values only in .
Definition 3
A network is discrete if all input and output alphabets are finite sets.222While this is technically an incorrect use of “discrete”, we use it to mean “finite alphabet” as this is the usual convention in the literature; see for example [29, p. 39].
For any , an code consists of:
- •
For each node and time , an encoding function
[TABLE]
- •
For each where , a decoding function
[TABLE]
Assume messages for are independent and each uniformly distributed over . The channel input from node at time is given by . For , the estimate of at node is given by . We write for the complete vector of messages, and for the complete vector of message estimates. Given an code, the average probability of error is
[TABLE]
where denotes the event that there exists a node and a message index such that node decodes message incorrectly; that is, for any , . For blocklength and , let be the set of rates for which there exists an code with average probability of error at most .333We allow for any in our definitions for maximum generality, even though is a trivial case in which the rate region is unbounded. Given a sequence where for all , we say a rate vector is *achievable with respect to * if there exists an integer such that for all , . The capacity region is given by the closure of the set of all achievable rate vectors with respect to . Alternatively, we may define
[TABLE]
Throughout the paper, we use to denote a finite blocklength region, and to denote an asymptotic region. (Table I summarizes this notation.) Note that is defined as a function of the single value , whereas is a function of the infinite sequence .
In principle is defined for any sequence . However, it will be useful to restrict ourselves to sequences for which has a limit; the following proposition, proved in Appendix A, shows that we may do this without loss of generality for memoryless stationary networks.
Proposition 1
Let be any memoryless stationary network. For any , let and be two sequences where
[TABLE]
Then .
As consequence of Proposition 1, for any sequence where , . Thus, it is enough to focus on sequences where either for some , or . Note that the latter includes any sequence converging to a constant in .
For fixed , denotes the capacity region with asymptotic error probability . With some abuse of notation, define the usual asymptotically-zero-error capacity region as
[TABLE]
Equivalently we may write
[TABLE]
Remark 1
Using average probability of error rather than maximal probability of error in our definition of capacity region is not merely convenient; it is critical to many of our results. Indeed, it is illustrated in [15, 13] that edge removal characteristics are very different with maximal probability of error rather than average, and thus the relationship between edge removal and strong converses in the maximal probability of error context is likely to be different.
We proceed to define 7 different properties: 3 notions of a strong converse and 4 notions of the edge removal property. The relationships that we will prove among these properties are shown in Fig. 1.
II-B Strong Converses
Definition 4
Strong converses are defined in terms of whether, for a given constant and a sequence ,
[TABLE]
We say network satisfies:
- •
the extremely strong converse if for all , (16) holds if , where is a positive constant depending only on the network.
- •
the exponentially strong converse if for all , (16) holds for some where .
- •
the strong converse if for all , (16) holds for some where .
Remark 2
Statements similar to (16) will occur throughout this paper; this condition may be alternatively written as follows: for any , there exists such that for all .
Remark 3
One can see immediately that the strong converses are ordered by strength; i.e., the extremely strong converse implies the exponentially strong converse, which in turn implies the ordinary strong converse.
The following proposition gives some equivalent definitions for each of these strong converse properties. It is proved in Appendix B.
Proposition 2
Network satisfies the extremely strong converse if and only if there exists a constant depending only on such that either of the following hold:
- (a)
For any , any sequence of codes has probability of error satisfying
[TABLE]
where is the smallest number such that . 2. (b)
For any sequence where , . 2. 2.
Network satisfies the exponentially strong converse if and only if either of the following hold:
- (a)
For all , any sequence of codes has probability of error approaching 1 exponentially fast. 2. (b)
For any sequence for which , . 3. 3.
Network satisfies the strong converse if and only if any of the following hold:
- (a)
For all , any sequence of codes has probability of error approaching 1 as . 2. (b)
For all , . 3. (c)
There exists a sequence where and .
Remark 4
Exponential bounds on the probability of success for rates above capacity for point-to-point channels were first considered in [22]. Later, [23] exactly characterized the optimal exponent of the success probability for rates above capacity. Similar results have been found for network problems in [24, 25, 26, 27]. For point-to-point channels, [23] showed that for a discrete-memoryless point-to-point channel with capacity , for all the optimal probability of error satisfies where
[TABLE]
where and are the marginal and conditional distributions derived from respectively, is the mutual information between and where , and represents the positive part. Intuitively, represents an empirical conditional distribution; correct decoding is possible if the channel behaves like one with capacity greater than (i.e. when the second term in (18) is zero), and the first term in (18) is the exponential rate of the probability that channel behaves like with input distribution .
This result constitutes an exponentially strong converse in our terminology, since for all , but interestingly it is not an extremely strong converse for many noisy channels. Note that an extremely strong converse is equivalent to \frac{d\alpha(R)}{dR}\big{|}_{R=C}>0. However, as we show in the following proposition (proved in Appendix C) this holds only for very specialized channels.
Proposition 3
Consider a discrete-memoryless point-to-point channel with capacity . Let be the (unique) capacity-achieving output distribution. If
[TABLE]
*then . Otherwise, \frac{d\alpha(R)}{dR}\big{|}_{R=C}=0. *
Examples of point-to-point channels that satisfy (19) include:
- •
essentially noiseless channels, i.e., where ,
- •
completely noisy channels, i.e., where is independent of ,
- •
noisy typewriter channels, i.e., where with summation over some group , where is uniform on a subset of and independent of .
Note also that (19) implies that the channel dispersion is 0 (cf. [17, Thm. 49]), but the converse is not true. In particular, the channel dispersion is 0 if and only if there exists a capacity-achieving input distribution such that for all and all with . However, (19) can fail to hold if for some pair even if for all capacity-achieving input distributions . (For example, this is the case for channels termed exotic in [17].)
However, most channels of interest do not satisfy (19), including binary symmetric channels and binary erasure channels. Thus, while we are able to show equivalence between the extremely strong converse and the strong edge removal property for deterministic networks (see Fig. 1), this equivalence cannot hold for many noisy networks, as the extremely strong converse simply does not hold.
II-C Edge Removal Properties
For a subset of nodes and an integer , we define a modified network , illustrated in Fig. 2, as follows: Start with , and add two nodes denoted and .444These are special nodes in that messages do not originate at them. Thus the capacity region of has the same dimension as that of . For each node , add an infinite capacity link from to , and an infinite capacity link from to . Finally, add a bit-pipe from to that can noiselessly transmit bits total across the -length coding block. In the case that is not an integer multiple of , this bit-pipe cannot be modeled as a stationary memoryless channel. Instead, we assume that the bits are scheduled such that after timesteps, have been transmitted; that is, at time , the link is allowed to transmit exactly
[TABLE]
bits.555One could imagine other models, such as where the bit transmission schedule is flexible but chosen in advance by the code, or where the schedule can be chosen at run-time. These model variations are unlikely to impact results, but here we adopt the more restrictive model. Let be the set of rate vectors such that there exists an code on with average probability at most . That is, . Given sequences and where and , we define to be the capacity region of the sequence of networks where determines the dependence between the capacity of the edge and the blocklength. Formally, we define
[TABLE]
For the most part we are interested in the case that , so we define for convenience and . We further define and analogously to (14)–(15). For any , it is certainly true that . Note also that
Roughly, edge removal properties state that for small , the capacity of network is not too different from that of . To be precise, we define four different versions of this property as follows.
Definition 5
Edge removal properties are defined in terms of whether, for a given constant and a sequence ,
[TABLE]
We say network satisfies:
- •
the strong edge removal property if for all , (22) holds for , where is a positive constant depending only on the network.
- •
the weak edge removal property if for all , (22) holds for some .
- •
the very weak edge removal property if for all , (22) holds for some .
- •
the extremely weak edge removal property if for all , (22) holds for all bounded .
Remark 5
One can again see immediately that the edge removal properties are ordered by strength; i.e., the strong property implies the weak property, which implies the very weak property, which implies the extremely weak property.
The following proposition gives several alternative definitions of each of the edge removal properties. It is proved in Appendix D.
Proposition 4
The strong edge removal property holds if and only if there exists a finite positive constant depending only on the network such that for all ,
[TABLE] 2. 2.
The weak edge removal property holds if and only if,
[TABLE]
and also if and only if
[TABLE] 3. 3.
The very weak edge removal property holds if and only if
[TABLE]
and also if and only if
[TABLE] 4. 4.
The extremely weak edge removal property holds if and only if
[TABLE]
Remark 6
Most works on the edge removal problem (e.g., [1, 2]) consider removing an arbitrary edge from the network, rather than the specific topology shown in Fig. 2. Most similar to this topology is the notion of a super-source network in [30], which was defined for source coding problems as a network containing a node that can view all sources, and has links to each other node. Another similar notion from the literature is that of the cooperation facilitator [9, 10, 11, 12, 13, 14], which connects to the transmitting nodes (but not the receiving node) in a multiple-access network. We choose the topology in Fig. 2 because it ensures that the link that is added/removed is at least as useful as any other link. That is, when , then node has complete knowledge of every signal sent in the network, so the link can be used to simulate any other small-capacity link. In particular, for any network consisting of supplemented by a link (or multiple links) with total capacity at most bits, then . One example of such a network is one that allows for rate-limited feedback. For this reason, one consequence of edge removal results are outer bounds on networks with rate-limited feedback.
Remark 7
The extremely weak edge removal property, wherein the extra edge carries a bounded number of bits as the blocklength grows, appears in none of our results proving relationships to strong converses. Nevertheless, we have chosen to include this definition because it is a natural one, and indeed the property seems tantalizingly likely to be true for all realistic systems. However, it was shown in [15] that for maximal error probability, there exists a network where the extremely weak property does not hold. This again points to the contrast between average and maximal error probability. In light of our other results, the extremely weak property also presents an interesting question: namely, is it equivalent to some version of a strong converse? Based on our results that for some networks, the very weak edge removal property is equivalent to the ordinary strong converse, if there is an equivalent converse to the extremely weak property, it appears that it would need to be weaker than the ordinary strong converse, but perhaps stronger than the ordinary weak converse. No such property has occurred to us.
III Deriving Edge Removal Properties from Strong Converses
The following theorem states that each of the three strong converse properties implies one of the edge removal properties. This result holds for any causal network channel given by (7).
Theorem 5
For any network , the following hold:
The strong converse implies very weak edge removal. 2. 2.
The exponentially strong converse implies weak edge removal. 3. 3.
The extremely strong converse implies strong edge removal.
Statement (2) of this theorem was proved for noiseless networks in [16, Sec. 3.3]. Our proof uses essentially the same principle as theirs, namely converting a code on a network with an extra edge to a code on a network without one by fixing a value sent along this edge, and assuming at all other nodes that this value was sent. The following lemma provides a refined version of this argument, relating the achievable rate regions for the network with and without the extra edge at finite blocklengths.
Lemma 6
For any integers and and any ,
[TABLE]
Proof:
Let , so there is an -length code with rate vector and probability of error at most on network . We convert this code to one on network as follows. Under the code on , let be the message sent on the link from node to node . Recall that . Let be the overall error event for network . We have
[TABLE]
There must be some for which
[TABLE]
Construct a code for network that behaves exactly like the original code on network , except that all nodes assume that node received the signal . Let be the probability of error for this code. Note that with probability , the code’s behavior will be just as if the code on were in effect. Thus
[TABLE]
Therefore . ∎
Proof:
We first show statement (1). Assume the strong converse holds. Thus
[TABLE]
where (33) follows from Lemma 6; (34) follows from the strong converse, because for any and ; and (35) follows because is closed. Therefore, very weak edge removal holds by the equivalent definition in (27) of Proposition 4.
We now prove statement (2). Assume the exponentially strong converse holds. For any , we have
[TABLE]
where (36) follows from Lemma 6, (37) from the fact that , and (38) from the exponentially strong converse. Therefore weak edge removal holds.
We now prove statement (3). Assume the extremely strong converse holds. For any we have
[TABLE]
where (39) follows from Lemma 6. Note that . Thus if , then, by the extremely strong converse, for some constant . Therefore strong edge removal holds. ∎
IV Deterministic Networks
The following theorem states that for deterministic networks, each implication of Theorem 5 is also an equivalence.
Theorem 7
For any deterministic network , the following hold:
The very weak edge removal property holds if and only if the strong converse holds. 2. 2.
The weak edge removal property holds if and only if the exponentially strong converse holds. 3. 3.
The strong edge removal property holds if and only if the extremely strong converse holds.
To prove Theorem 7, we begin with several lemmas. The first is the well-known reverse Markov inequality, which will be instrumental in proving that edge removal properties imply strong converses.
Lemma 8
Let be a real-valued random variable where a.s. For any ,
[TABLE]
The following lemma provides the core result that is needed to prove Theorem 7. The proof is adapted from that of [31, Lemma 2].
Lemma 9
Let be a deterministic network. For any , any , and any ,
[TABLE]
where
[TABLE]
Proof:
Let . That is, there exists a code with rate vector and blocklength achieving probability of error . The key to the proof is to show that if the rates are reduced slightly from those in , then an extra edge allows achieving arbitrarily small probability of error. In particular, given a target probability of error , define a rate vector given by
[TABLE]
where we choose with hindsight (recall is the number of messages in the network)
[TABLE]
We will proceed prove that
[TABLE]
by constructing a code of rate on network . However, to prove the lemma we need to show that , rather than , is contained in the right-hand side (RHS) of (41). Given (45) and that , we may simply expand the edge from node to to carry additional bits, adding bits for each message, which implies
[TABLE]
This is now enough to prove the lemma, since where is defined in (42).
We now prove (45). For , let be the message set for the th message of the original code of rate and probability of error , and let
[TABLE]
be the set of complete message vectors . Let , so . Since the network is deterministic and the code is fixed, whether or not an error occurs depends entirely on the message vector that is chosen. Let be the subset of of message vectors that do not lead to errors. Thus the probability of error is precisely . By the assumption that the probability of error is at most , we have that
[TABLE]
Recall that if , so this message is not significant. For ease of notation, we assume for now that for all messages , so that . We employ a version of a random binning argument. For each , randomly choose the sets
[TABLE]
to be a partition of where for all , such that all such partitions are equally likely. Furthermore, let for be the set of message vectors such that for all . Given these partitions, the code proceeds as follows. Messages are all transmitted to node . Node then chooses a message vector from the set in an arbitrary manner. If this set is empty, then we declare an error. For each , let be the index of in the set . Node determines for each and transmits to node . Note that the number of bits required is .
At the originating source node for message , can be determined from and . Subsequently, the code proceeds as if were the true message vector. When a destination node produces a message estimate , it constructs the final message estimate as the such that . Since by assumption , there is no error as long as is not empty.
For let
[TABLE]
where the probability is with respect to the random choice of partitions . We proceed to show that for all . Thus, the probability of error averaged over both the message vector and the random choice of partitions is at most . This proves that there exists at least one deterministic code with average probability of error .
For each , define for all , the set
[TABLE]
Moreover, define
[TABLE]
We claim that for all , if is such that , then
[TABLE]
To prove this for , assume . Define the random variable
[TABLE]
where as usual is uniformly distributed on . Note that
[TABLE]
where the inequality follows from the assumption that . Hence
[TABLE]
where (59) follows from Lemma 8 and the fact that , and (60) follows from (57). This proves (53) for . For , note that if , then by the definitions of and ,
[TABLE]
This proves (53) for .
Fix . For each , define
[TABLE]
Note that for , certainly for all , so . Moreover, since , by definition . Thus , so
[TABLE]
To upper bound , suppose , so there exists some . If is empty, then . Recall that is one set of a random partition of , which is chosen independently of . In particular, is chosen uniformly among all subsets of of size , so
[TABLE]
Since by assumption , we have , so we may apply (53) to bound
[TABLE]
Thus
[TABLE]
where (69) follows since for integers , (71) follows since , (72) follows from the choice of in (44), (73) follows by the assumption that for all , and (74) follows since for any . This last fact can be seen by noting that is decreasing in , which holds because its derivative is given by
[TABLE]
∎
Proof:
Theorem 5 proves that each strong converse property implies the corresponding edge removal property, so we only need to prove the opposite directions.
Suppose the very weak edge removal property holds. For any constant , applying Lemma 9 gives
[TABLE]
where the last equality holds by very weak edge removal. Therefore the strong converse holds.
Now suppose the weak edge removal property holds. For any sequence where , applying Lemma 9 gives
[TABLE]
where (80) follows since for any and , for sufficiently large ; and (82) follows from weak edge removal, since . Therefore the exponentially strong converse holds.
Finally, suppose the strong edge removal property holds. For any , let where . Applying Lemma 9 gives
[TABLE]
where (83) follows from Prop. 1, (84) follows from Lemma 9, (85) follows because for sufficiently large , (86) follows by the definition of , and (87) follows by the equivalent form of the strong edge removal property in (23), where is a finite positive constant depending only on the network. Therefore, this network satisfies equivalent form of the extremely strong converse in Prop. 2 part (1b). ∎
V Discrete Stationary Memoryless Networks
The following is our main theorem for discrete stationary memoryless networks, connecting the exponentially strong converse to the weak edge removal property. In addition, we show that both these properties are equivalent to an even weaker form of the weak edge removal property—namely, where the nodes and connect only to transmitting nodes; i.e. those nodes where . (Recall the definition being the capacity region of the network with nodes and connected only to nodes in .) This is a generalization of the “cooperation facilitator” model from [9, 10, 11, 12, 13, 14], which connected only to the transmitters in a multiple-access channel, but not the receiver. The intuition behind connecting only to transmitting nodes is that the extra edge is useful when encoding but not decoding. The reason is that when decoding, a node attempts to reconstruct a message, which is available exactly at the message’s source node. Thus, any small amount of information sent from the omniscient node could equally well be sent from the source node. However, when encoding, the “ideal” transmission may be a function of multiple messages, which are simultaneously available only at the ominscient node . Therefore, even a small capacity link from to could in principle provide significant rate gain by connecting to an encoding node. However, if a node does not transmit, it only decodes and never encodes, so the connection from nodes and is not helpful.
Theorem 10
For any discrete stationary memoryless network , the following three statements are equivalent:
The exponentially strong converse holds. 2. 2.
The weak edge removal property holds. 3. 3.
For all ,
[TABLE]
for some sequence , where is the set of nodes such that .
Observe that statement 1 of the theorem implies statement 2 by Theorem 5. Note that statement 3 is identical to the definition of the weak edge removal, except that the left-hand side (LHS) of (88) is instead of as in (22); i.e., in the modified network, nodes and connect only to the set of transmitting nodes rather than all nodes. Since for any , , statement 2 of the theorem implies statement 3. Hence it remains only to show that statement 3 implies statement 1. The main tool in doing so will be a modified version of the blowing-up lemma. The blowing-up lemma, originally proved in [32] (see also [28, 33]), has been used in the proof of numerous strong converse results. In some sense our result is a generalization of this technique. The traditional blowing-up lemma is stated as follows.
Lemma 11
Let be a sequence of independent random variables. Fix where for a sequence . For any , define the blown-up version of as
[TABLE]
where is the Hamming distance. There exists a sequence where
[TABLE]
The following is a causal version of the blowing-up lemma. It is stronger than the usual blowing-up lemma, but it follows from a slight modification of Marton’s proof of the blowing-up lemma in [28]. One may view this lemma as a causal version of a transportation-cost inequality [33].
Lemma 12
Let be a random sequence, not necessarily independent. Fix . There exists a sequence of conditional distributions for such that, if we let have joint distribution
[TABLE]
then almost surely, and
[TABLE]
Proof:
Let be a random sequence with distribution that of conditioned on the set . That is,
[TABLE]
For any and , by [34, Theorem 1] there exists a pair of random variables with joint distribution such that the marginal distributions satisfy
[TABLE]
and their joint distribution satisfies
[TABLE]
We now define
[TABLE]
Let have distribution given by (91), where is defined in (97). Note that
[TABLE]
where (98) follows from (91), (99) follows from (94) and (97), and (100) follows from simple rules about joint distributions. Thus
[TABLE]
where (102) holds by (100), (103) holds simply because the summation in (102) represents the marginal distribution of , and (104) holds by (95). Thus and have the same distribution. In particular, since by construction almost surely, also almost surely. We now have
[TABLE]
where (107) holds by (100), (109) holds by (96), (110) holds by Pinsker’s inequality, (111) holds by concavity of the square root, (112) holds because and have the same distribution, (113) holds by the chain rule for relative entropy, and (114) holds because, by (93),
[TABLE]
∎
Remark 8
Lemma 11 can be derived from Lemma 12 as follows. If in Lemma 12, is a sequence of independent random variables, then by (91), has the same distribution as . Thus
[TABLE]
where (117) holds because almost surely, (118) holds by Markov’s inequality, and in (119) we have applied (92). Assuming where , if we choose, for example, , we have and
[TABLE]
This proves Lemma 11.
With Lemma 12 in hand, we complete the proof of Theorem 10 with the following lemma.
Lemma 13
*For any discrete stationary memoryless network , statement 3 of Theorem 10 implies statement 1. *
Proof:
By the same argument as in the proof of Proposition 4, statement 3 of Theorem 10 is equivalent to
[TABLE]
where again is the set of transmitting nodes. By Proposition 2, the exponentially strong converse holds if and only if, for any sequence where , . Thus, to prove the lemma it is enough to show that for any where , and any , . Let be achievable with respect to . Thus for sufficiently large there exists an -length code with average probability of error at most . Let be the encoding/decoding functions for this code (see (9)–(10)). We describe a new code, illustrated in Fig. 3, achieving the same rate vector with vanishing probability of error on the network . Note that for any , we have , so if the probability of success would be exponentially small; thus we must have .
Network stacking: We adopt the notion of network stacking from [35]. The motivation for our use of network stacking is that it allows us to convert an arbitrary coding operation at a single time instance into a coding operation across a long block, thereby taking advantage of the law of large numbers. In particular, we construct independent copies of the original -length code, each with its own messages, using a total of channel uses. Each copy is referred to as a “layer”, indexed by an integer . Unlike a block Markov approach [36], in which one would transmit an -length block corresponding to the original code in sequence, in the network stacking approach we transmit copies of a single time instance of the original code before moving on to the next one. Thus coding can be done “across the layers”, using the fact that the copies of any symbol are i.i.d., while maintaining the causal structure of the original code.
We use underlines to indicate symbols on the stacked network. In particular, is the transmitted symbol from node at time in layer ; refers to the -length sequence of symbols in layer ; refers to the -length sequence of symbols at time in all layers; refers to the full -length sequence of all layers and time instances. We define , etc. similarly. Moreover, is the message originating at node in layer , and is the complete vector of messages originating at node across all layers.
Code phases: Given the original -length code, we construct an -fold stacked code as follows, where the precise dependence between and is to be determined. The code consists of phases, each consisting of a number of timesteps. These phases are visualized in Fig. 3. First we have a message coordination phase, followed by transmission phases alternating with correction phases, and concluded with a hashing phase. In the message coordination phase, nodes coordinate to choose a message vector in each layer with a relatively large probability of success; this is done in exactly the same manner as for deterministic networks in Lemma 9. Each transmission phase corresponds to one timestep in the original code: the layers act independently, each performing the coding functions from the original code at time . In the following correction phase, node transmits data to node , describing replacements for certain received data in sub-network . Node then disperses this data to the nodes in ; in subsequent transmission phases, nodes in use this replaced data in their coding operations. In the final hashing phase, hashes of all messages are dispersed to all nodes, which allows nodes in to decode. This last phase is necessary because nodes and do not connect directly to nodes in ; thus the correction approach applied to the rest of the network does not work here, since node does not know what signals were received in . Instead, hashes are used to correct any remaining errors in messages decoded in .
The message coordination phase consists of timesteps. Each transmission phase consists of exactly timesteps, since each layer transmits exactly once. Correction phases have variable lengths, depending on how much correction data is required, but a total of timesteps are allocated for all correction phases, where
[TABLE]
The hashing phase consists of timesteps. Note that in total, the transmission phases consist of timesteps. Recalling that , as , so all other phases consist of a negligible number of timesteps.
Message coordination phase: For each message vector of the original code, let be the probability of correctly decoding . Let
[TABLE]
Defining , we may lower bound the cardinality of by
[TABLE]
where (125) holds by Lemma 8 and the fact that , and (126) holds since the average probability of error is at most .
In the message coordination phase, we use an identical outer code as in Lemma 9 to ensure that, with high probability, only message vectors in are ever used. By the same binning argument as in the proof of Lemma 9, this requires only bits on the link for each layer. Note that nodes and are only required to contact the nodes in , since nodes in have no message originating at them. We may therefore assume throughout the rest of this argument that for each .
Correction codebook: Let be the probability of correct decoding given message vector , and channel outputs at nodes . That is,
[TABLE]
where again is the complete vector of message estimates. Since encoding and decoding functions are assumed to be deterministic (cf. (9)–(10)), channel inputs are deterministic functions of and . Thus, the only randomness in the probability in (128) are the channel outputs given the inputs . Recalling that for , is an independent sequence given . For each message vector of the original -length code, let
[TABLE]
Note that for any ,
[TABLE]
Thus, applying Lemma 8 to the random variable gives
[TABLE]
We now apply Lemma 12 to the distribution and the set to find conditional distributions for all . Note that these distributions depend on the message vector . For each and , independently draw
[TABLE]
These functions constitute a codebook known to all nodes.
Hashing codebook: For each and each , independently and uniformly draw from . These hashing functions also constitute a codebook known to all nodes.
Transmission phases: Before the transmission phase at time , each node has determined , which represent the corrected versions of its received signals (see description below of the correction phases). For each , node determines and transmits
[TABLE]
For each , let be the corresponding received signals.
Correction phases: In the correction phase after the transmission phase at time , node learns from each , and determines, for each ,
[TABLE]
For each for which , node transmits to node a bit string with [math] followed by bits identifying the layer as well as the value of . After doing this for each layer where , node transmits the stop bit , signaling that all nodes should proceed to the next transmission phase. Node then forwards this data to each node . For all layers for which no correcting signal was sent, each node simply sets .
Hashing phase: Node computes for all , and transmits these values to node , which subsequently disperses them to nodes in .666One could also compute the hash for message directly at node , and distribute the hash to all decoder nodes from there. We choose to compute the hash at node makes merely to make distribution of the hashes simpler to describe. Note that these hashes consist of a total of bits, which is sub-linear in . Thus they can be transmitted over the link as long as . For each node , if there exists a node where the point-to-point channel from to has positive capacity, then we use a point-to-point channel code to transmit the hashes from node to node . If there is no such node , then all received signals at node are independent of the rest of the network, so node cannot decode any messages; in particular, if for any , it must be that . Since the hashes occupy a sub-linear number of bits, transmitting these hashes to each node in takes a sub-linear number of timesteps, and can be done with arbitrarily small probability of error.
Decoding: For each where and each , node determines
[TABLE]
Now consider and and each where . Given and , find the unique where and there exists where for each and
[TABLE]
If there is no such or more than one, declare an error.
Probability of error analysis: Consider the following error events
[TABLE]
and, for and ,
[TABLE]
Note that as long as does not occur, then by Lemma 12, for all . By the definition of , this ensures that for all and . Events cover all errors that can occur at nodes in . Hence the probability of error of the overall code, averaged over random coding choices, is
[TABLE]
We first consider . The number of bits transmitted across link during the correction phase at time is
[TABLE]
where the final accounts for the stop bit. Thus the number of bits transmitted during all correction phases is
[TABLE]
Recall link has capacity , meaning it can transmit a bit roughly every timesteps (cf. (20)). Thus we can bound by
[TABLE]
where (147) follows from Markov’s inequality, (148) follows from Lemma 12, where we have dropped the constant since it is less than , (149) from the assumption that for all , and (150) from the definition of in (122). If we choose , then
[TABLE]
which vanishes since as .
Now we consider events . Recall that if does not occur, then for all . By the definition of in (129), we have, for any
[TABLE]
Note that given and , is determined since coding functions are deterministic. Since for all , this conditioning also determines . Thus, the distribution is independent. Applying the blowing up lemma to this distribution and the set of that cause all messages to be decoded correctly in , there exists a random sequence that causes all messages to be decoded correctly, and
[TABLE]
In particular, if we produce copies of this sequence for each layer, then Markov’s inequality gives
[TABLE]
In particular, for each and , with probability at least , there exists that satisfies the Hamming distance condition (138), and is decoded correctly to . Thus vanishes. We now consider . The number of messages that are considered is upper bounded by the number of sequences satisfying (138), which is given by
[TABLE]
where is the binary entropy function. The probability that any given agrees with the hash value is , so
[TABLE]
where (159) holds for sufficiently large , since and , and (160) holds again by the choice . Since as , vanishes. ∎
Remark 9
The blowing-up lemma does not appear to be strong enough to prove that the very weak edge removal property implies the ordinary strong converse. Were we to apply the same argument above to the case , in the key application of the blowing-up lemma in (148), we would have
[TABLE]
This suggests that at least bits per layer would be required on the extra link. However, very weak edge removal requires that we achieve the same capacity region using any sequence of bits converging to infinity, which includes sequences growing smaller than .
VI Networks of Independent Point-to-Point Links
We now consider the setting of network equivalence [35], in which consists of a stationary memoryless network made up of independent point-to-point (noisy) links. Let be the same network in which each noisy point-to-point link is replaced by a noiseless bit-pipe of the same capacity. The basic result of network equivalence states that . Theorem 10 already asserts that for such networks, the weak edge removal property holds if and only if the exponentially strong converse holds. The following theorem proves that, for such networks with acyclic topology, the same holds for the “lower level” in Fig. 1; i.e., the very weak edge removal property and the ordinary strong converse. The proof, given in Appendix E, makes use of the network equivalence principle to connect codes on to codes on , and then applies Theorem 7 on .
Theorem 14
For a discrete stationary memoryless network consisting of independent point-to-point links with acyclic topology, the very weak edge removal property holds if and only if the strong converse holds.
VII Applications
VII-A Outer Bounds
Consider any outer bound for the memoryless stationary network ; i.e. where . Suppose we could show
[TABLE]
where as usual is the set of nodes where . In other words, the outer bound is continuous with respect to the capacity of the extra edge; that is, the outer bound satisfies a weak edge removal property. Then, applying Lemma 13, we immediately find
[TABLE]
This suggests that the outer bound holds in an exponentially strong sense; that is, for any rate vector outside , the probability of error approaches 1 exponentially fast.
An outer bound may also satisfy a strong edge removal property, meaning that for some constant and any ,
[TABLE]
We have no equivalence between the strong edge removal property and the extremely strong converse for general noisy networks, but we do for deterministic networks. Thus, applying Lemma 9, if a deterministic network satisfies (164), then the outer bound holds in an extremely strong sense; that is, for any rate vector outside , the probability of error approaches 1 at an exponential rate linear in the distance to the outer bound.
For many outer bounds (indeed, almost every computable outer bound that we know of), (162) can be proved without much difficulty, and in some cases the stronger statement (164) can be proved as well. This implies that most outer bounds for discrete memoryless networks hold in an exponentially strong sense, and many outer bounds for deterministic networks hold in an extremely strong sense. We illustrate this for several outer bounds (or weak converse arguments) in the next few subsections.
VII-B Cut-set Bound
Recall that the cut-set outer bound [37] is given by where
[TABLE]
In the following, we prove (164) for this bound. This allows us to reproduce the result of [21], that the cut-set bound holds in an exponentially strong sense: that is, for any rate vector outside , the probaility of error goes to 1 exponentially fast. This further implies that any network with a tight cut-set bound (i.e., where ) satisfies the exponentially strong converse. Furthermore, we conclude that for deterministic networks, the cut-set bound holds in an extremely strong sense.
Fix some sequence , and let . Consider a code achieving this rate vector, and let be the symbol sent along edge at time , or if there is no symbol at time . Note . Fix any cut set , and let . Also let be the set of message flows that cross the cut; that is, the set of where . We may write
[TABLE]
where (167) follows from Fano’s inequality, where as ; (169) follows since is a function of and ; (172) follows from the memorylessness and causality of the network model; and (173) follows by defining , , and , and by the fact that . Recalling that , we have
[TABLE]
In particular, (164) holds with . This in turn implies (162). Therefore, for discrete memoryless stationary networks, the cut-set bound holds in an exponentially strong sense, and for deterministic networks, the cut-set bound holds in an extremely strong sense.
These facts allow us to immediately derive strong converse results for various problems for which the cut-set bound is tight. For example:
since the cut-set bound is tight for relay channels that are degraded, reversely degraded [36], or semideterministic [38], the exponentially strong converse holds. 2. 2.
since the cut-set bound is tight for linear finite-field deterministic multicast networks [39], the extremely strong converse holds.
VII-C Broadcast Channel
A broadcast channel is a network where , for all , and we allow multiple messages to originate at node 1, each to be decoded at a subset of nodes in . Note that this model includes scenarios where there are private messages, public messages, and/or messages intended for some decoders but not all. We claim that the weak edge removal property and the exponentially strong converse hold for discrete memoryless broadcast channels. Indeed, the set in Theorem 10 is simply . Thus, for any sequence (whether or not it is ), , simply because if the extra nodes and can only communicate with node , then any processing done at nodes and can simply be reproduced internally at node 1. Theorem 10 immediately proves the claim.
For degraded broadcast channels, the strong converse was proved in [32], and the exponentially strong converse in [40]. However, since the capacity of the broadcast channel in general is unknown, strong converses for general broadcast channels have received little attention. As far as we know, this is the first strong (or exponentially strong) converse that has been proved for a problem for which the capacity region has no known single-letter characterization. In [41], a strong converse was established for a common randomness generation problem for which a single-letter characterization was established in [42]; this strong converse generalizes to non-discrete alphabets, including sources where the single-letter characterization has no known computable characterization, because of an auxiliary random variable. Both the result of [41] and our result on the broadcast channel are examples of strong converses for problems with no known computable rate region. The simplicity of the above proof on the broadcast channel, once we have Theorem 10, is particularly noteworthy.
VII-D Discrete 2-User Interference Channel with Strong Interference
A 2-user interference channel, illustrated in Fig. 4, is a network with 4 nodes, where , , and . Note that, to be consistent with the notation in the rest of the paper, the received symbol by the node decoding the first message is , rather than , as it is typically denoted.
Recall that an interference channel has strong interference [43] if
[TABLE]
for all . The capacity region of the interference channel in this regime was found in [44] to be the set of rate pairs such that
[TABLE]
for some with .
The following proposition establishes the exponentially strong converse under strong interference. The strong converse for the interference channel with very strong interference (in addition to fixed-error second-order results) was derived in[45]. The strong converse for the Gaussian interference channel with strong interference was proved in [46].
Proposition 15
For an interference channel with strong interference, weak edge removal and the exponentially strong converse hold.
Proof:
Note that the only nodes in an interference channel where are the encoder nodes, i.e. nodes and . Thus, by Theorem 10, to prove the proposition it is enough to show that for any , , where is the region defined in (177)–(179).
We claim that an interference channel with strong interference also satisfies (176) for any joint distribution , even when are not independent. Consider any joint distribution . For fixed , define where and deterministically. Since is deterministic, and are trivially independent, so by (176) we have
[TABLE]
where represent the outputs of the channel with as inputs. Note that . Thus and , so by (180)
[TABLE]
Since (181) holds for any , we have
[TABLE]
Similar reasoning establishes the second inequality in (176) for any . This proves the claim.
Now, by the same proof as the lemma in [44] for the independent case, for any ,
[TABLE]
where
[TABLE]
Consider where . Thus, there exists a sequence of codes with rates , with vanishing probability of error, on the modified network with an extra edge carrying bits as a function of the blocklength . Given a code of blocklength , let be the signal sent on the edge at time . Note that, since , for most values of , no bit is transmitted across at time (cf. the transmission schedule in (20)); for these we simply take to be null. Certainly . Since for , is a function of message and , we have
[TABLE]
where (190) follows since the messages are assumed to be independent. Since node only has access to , we have the Markov chain
[TABLE]
We now write
[TABLE]
where in (195) we have used the fact that , and Fano’s inequality, where as , and (197) holds by the Markov chain in (192). Similarly
[TABLE]
We also have
[TABLE]
where in (205) we have again used the Markov chain in (192). Combining (198) with (207) gives
[TABLE]
where (209) follows from (185). We may also repeat this argument to find (210) with replaced by . To summarize,
[TABLE]
One can see that this is precisely the region for the interference channel when both messages are required to be decoded at both decoders, except that we have close-to-independence instead of exact independence. The difficulty with condition (214) is not just that are not perfectly independent, but that the dependence between individual letters may vary depending on . The method of Dueck in [47] (also similar to Ahlswede’s “wringing” technique [48]) allows us to show that for most , the letters are nearly independent. This will allow single-letterization of the region in (211)–(214). In particular, there exist some and , where for all
[TABLE]
where
[TABLE]
We reproduce the essential proof of this fact from [47] as follows. First, let
[TABLE]
If is empty, then we may take and we are done. Otherwise, let be any element of . We may write
[TABLE]
where (220) follows from (214) and the fact that as defined in (217). Next, let
[TABLE]
If is empty, then we may take and again we are done. Otherwise, take to be any element of , and proceed as above. This process must terminate after a finite number (say ) of steps, at which point (215) must hold for all . By a similar argument as in (218)–(220), for each
[TABLE]
and in particular
[TABLE]
Since the mutual information is nonnegative, we have .
We now have
[TABLE]
where
[TABLE]
Applying (211), and performing similar analyses for (212)–(213), combined with (215), we have
[TABLE]
Using standard tools to bound the cardinality of auxiliary random variables (e.g., [29, Appendix C]), for each , there exists a joint distribution with that preserves the value of each mutual information quantity in (231)–(234). Recall that we started with a different code for each blocklength , so the above procedure results in a different joint distribution for each . This constitutes a sequence of joint distributions on a compact set, so there exists a convergent subsequence, with limit . Since , , and mutual information is continuous for fixed alphabets, this limiting distribution must satisfy (177)–(179); moreover, in the limit (234) implies that , we may factor the joint distribution as . Finally, we may further reduce the cardinality of the auxiliary random variable in (177)–(179) to . ∎
VIII Conclusions
This paper explored the relationship between edge removal properties and strong converses. Our main results are summarized in Fig. 1. We found three main levels of properties for both edge removal and strong converse, and showed that for a very large class of networks, the strong converse property implies the corresponding edge removal property. Implications in the opposite direction hold for deterministic networks and sometimes for memoryless stationary networks.
Our strongest results are those for the “middle” level in Fig. 1, connecting the weak edge removal property to the exponentially strong converse. In particular, we showed that these properties are equivalent for all discrete memoryless stationary networks. Thus, if an existing weak converse or outer bound can be strengthened to show that it still holds in the presence of an extra link carrying a sub-linear number of bits, then the converse or outer bound also holds in an exponentially strong sense, meaning that for any rate vector outside the region, the probability of error converges to 1 exponentially fast. It appears that many existing arguments can be strengthened in this sense with relatively little effort, thereby proving exponentially strong results. We believe that this middle level deserves more focus than it has received so far, because exponentially strong converses and weak edge removal properties seem to hold for so many problems (at least under average probability of error). Therefore, one should always ask whether a given converse proof can be strengthened in this sense.
Several open problems remain:
The most important question is whether edge removal and strong converse properties hold in general. In particular, we know of no memoryless stationary network for which the weak edge removal property or the exponentially strong converse does not hold under average probability of error. The techniques of Sec. VII seem to allow one to prove a weak edge removal property (and thus an exponentially strong converse) for most (perhaps all) existing single-letter outer bounds, but there is no apparent way to do this without an existing single-letter result. Our observation that the properties hold for the discrete broadcast channel suggest that it may be possible to prove such results even for problems without known single-letter characterizations of the capacity region, but we know of no other cases for which this has been done. 2. 2.
Many of our results (particularly those showing that edge removal implies a strong converse) apply only for discrete channel coding problems; generalizing these results to continuous systems, channel cost constraints, source coding contexts, and random channel state would allow applicability to many other important network information theory problems. 3. 3.
We conjecture that an equivalence holds for discrete memoryless networks on the “lower layer” in Fig. 1, between very weak edge removal and the ordinary strong converse, but we have only been able to prove this result for deterministic networks and acyclic networks of independent point-to-point links. 4. 4.
Finally, it would be interesting to find a strong converse property equivalent to the extremely weak edge removal property.
Acknowledgements
The authors would like to thank Vincent Y. F. Tan, Michelle Effros, and Silas L. Fong for helpful discussions and feedback.
Appendix A Proof of Proposition 1
We will show that ; the opposite direction follows by reversing the roles of and . Fix any rate vector
[TABLE]
We aim to show that . There exists such that for all , . By the assumption of the lemma, there exists a subsequence such that
[TABLE]
For sufficiently large , we have , so . That is, there exists an -length code with rate and probability of error at most . Fix integer , and form a new code on network of length and rate as follows. Roughly, reduce the overall probability of error by repeating the original code times, and introducing a small amount of error correction in the form of an outer maximum distance separable (MDS) code [49, Chap. 4]. In particular, for each node where , form a MDS code on symbols from the finite field of order . This code exists for sufficiently large (e.g., a Reed-Solomon code [49, Chap. 5]). Let the MDS codeword be denoted by . Repeat the original code times, where on the th repetition is treated as the message originating at node . Because each outer code is MDS, one error can be corrected, so if it most one of the repetitions results in an error, the full code will decode correctly. Because the network is memoryless and stationary, each repetition is independent and results in error with probability , so the probability of error for the full code is given by
[TABLE]
Note that (236) and the assumption that imply that , meaning . Thus
[TABLE]
In particular, for sufficiently large , we have
[TABLE]
Hence, for any and sufficiently large ,
[TABLE]
Consider any blocklength where . We may convert a code with blocklength to one with blocklength simply by ignoring the additional symbols. This reduces the rate by a factor of , but does not change the probability of error. Thus we have
[TABLE]
By the liminf assumption on in (13), for sufficiently large we have
[TABLE]
Thus, if , we have
[TABLE]
where (245) holds by (244) for sufficiently large . Hence, for any , for all sufficiently large we have
[TABLE]
Thus
[TABLE]
Since (248) holds for all , and is closed, we have . Note that both and must go to infinity, but converges to infinity first for fixed in (240).
Appendix B Proof of Proposition 2
Extremely strong converse (1b): By taking , the extremely strong converse holds if and only if, for any ,
[TABLE]
By Proposition 1, if . This proves that the extremely strong converse is equivalent to the condition in (1b).
(1a) (1b). Consider any where , and any . If , then obviously . If , then by condition (1a) we have , and . Thus . This proves (1b).
(1b) (1a). Consider any , and any sequence of codes with probability of error . By Proposition 1, this implies , where
[TABLE]
Hence, by condition (1b), . If is the smallest number such that , then we have . This proves (17), and hence (1c).
Exponentially strong converse (2b). Let be a sequence where . By the exponentially strong converse, for any there exists where where (16) holds. For sufficiently large , , meaning . Thus
[TABLE]
As this holds for all , we have . This proves condition (2b).
(2b) Exponentially strong converse. Specifically, we prove that if the exponentially strong converse does not hold, then condition (2b) does not hold. Suppose there exist such that for all where , . Specifically, for any integer , . Since the sets are sorted (decreasing as grows), there exists in the interior of for all integers such that . For all , there exists such that for all ,
[TABLE]
Define a sequence
[TABLE]
Note that for , so . Moreover, for any , there is some such that and , so by (252), for all . Thus . But since , (2b) does not hold.
(2a) (2b). By (2a), for any , the probability of correct decoding must vanish exponentially fast, so for any sequence such that . Therefore , which proves (2b).
(2b) (2a). For any and any sequence for which , it cannot be that , or else by (2b) we would have . Therefore must approach 1 exponentially fast, which proves (2a).
Strong converse (3b). Note that the condition in the definition of the strong converse that can be more simply written as . Consider any . By the strong converse, for any , there exists a sequence where . Noting that for sufficiently large , we have . As this holds for all , we have , which proves (3b).
(3b) (3c). By (3b), for any integer , . In particular, there exists such that for all ,
[TABLE]
Define a sequence
[TABLE]
Certainly for , meaning . Moreover, if are such that , then
[TABLE]
Since , we have
[TABLE]
This proves (3c).
(3c) Strong converse. By (3c), there exists a sequence where for all . This proves the strong converse.
(3c) (3a). By (3c), there exists where for any . This implies that any sequence of codes must have probability of error exceeding for sufficiently large , so the probability of error must approach 1, which proves (3a).
(3a) (3b). For any , by (3a) any has probability of error approaching 1, so . Therefore, , which proves (3b).
Appendix C Proof of Proposition 3
Consider a channel where (19) holds. For any , we may write
[TABLE]
where (261) follows from (19), and the fact that relative entropy is non-negative. Thus, we may lower bound by
[TABLE]
where (263) holds because for any real numbers . This lower bound is achievable by setting , where is any capacity-achieving input distribution, so indeed .
Now consider a channel where (19) does not hold. That is, there exists some where
[TABLE]
Let be any capacity-achieving input distribution. Thus,
[TABLE]
In particular, there exists some where
[TABLE]
and . For parameter , define a joint distribution where
[TABLE]
As long as , this is a valid distribution. If we marginalize out , we see that
[TABLE]
By [51, Lemma 17.3.3], the first term in the Taylor expansion for around is
[TABLE]
By [50, Cor. 1 in Sec. 4.5], for all that are reachable from some input symbol. Note that (264) implies that , and also by assumption . That is, both and are reachable output symbols, so . Thus in (269) the coefficient on is finite, and so
[TABLE]
Noting that
[TABLE]
we have
[TABLE]
where we have used the assumptions in (264) and (266).
Applying the derivation in (258)–(260), we have
[TABLE]
where we have used (270), (272), and the fact that is also the derivative of the second term in (274).
Given small enough so that is a valid distribution, we may upper bound
[TABLE]
Thus,
[TABLE]
where in (279) we have used the fact that , so ; and (280) follows from the definition of in (272), as well as (275). Note also that this derivation is valid only because , as shown in (272). Since is non-decreasing in , we must have \frac{d\alpha(R)}{dR}\big{|}_{R=C}=0.
Appendix D Proof of Proposition 4
Statement 1 follows immediately from the definition of the strong edge removal property.
We now prove statement 2. Suppose the weak edge removal property holds. Thus, for any , there exists a sequence satisfying (22). Let
[TABLE]
Note that , and so for any , we have for sufficiently large . Thus
[TABLE]
Hence, the LHS of (24) is contained in . Since this holds for all , this proves (24).
Now we show that (24) implies the weak edge removal property. For any , by (24) there exists such that . Thus, setting satisfies (22). This proves the weak edge removal property.
To prove that the weak edge removal property is also equivalent to (25), we will show that
[TABLE]
To show in (283), we need to show that for all , is contained in the RHS of (283), or that for all . Indeed this holds because for any and any , for sufficiently large . To show in (283), let be in the RHS of (283). Thus, for all , for sufficiently large we have . In particular, for any fixed integer , we may let , so there exists such that for all we have
[TABLE]
Let
[TABLE]
By (284), for any we have
[TABLE]
Letting , we may rewrite (286) as
[TABLE]
Note that for any integer , if , then , so . Thus ; i.e., . From (287), we have . This proves in (283).
We now prove statement 3. Note that the very weak edge removal property is equivalent to the statement that for all ,
[TABLE]
This is easily seen to be equivalent to (26).
To show that the very weak edge removal property is also equivalent to (27), we show that
[TABLE]
Noting that
[TABLE]
it is enough to show that for all ,
[TABLE]
For any and any sequence , for sufficiently large . Thus
[TABLE]
Taking a closure yields in (291), since the LHS of (291) is already closed. To prove the opposite direction, let be a positive sequence where . For fixed and , by the definition of in (21), there exists such that for all , we have
[TABLE]
Now define a sequence
[TABLE]
Note that for any , for all , so as , because for any , for all . Thus the LHS of (291) is contained in . Moreover
[TABLE]
where (295) holds by definition, (296) follows from (293), (297) holds because , and (298) holds because for any , is some integer. This proves in (291).
We now prove statement 4. The definition of the extremely weak edge removal property may be equivalently written
[TABLE]
Note that for any bounded , for some constant integer . Thus the LHS (299) can be written
[TABLE]
Moreover, the RHS of (299) is simply . Therefore the extremely weak edge removal property is equivalent to (28).
Appendix E Proof of Theorem 14
A significant technical tool in proving network equivalence (cf. see the discussion in Sec. VI, and the original result in [35]) is the idea of channel simulation, in which a point-to-point channel is accurately simulated by any other with higher capacity. This idea was at the heart of the proof in [35]. A version of this idea was stated in [53] as the universal channel simulation lemma, stated as follows. This lemma states that two nodes with shared randomness (represented by ) can use a noiseless link to accurately simulate a noisy channel, as long as the capacity of the noiseless link is greater than the capacity of the noisy channel. While [53] did not provide a proof, we presented a proof in the appendix of [54].
Lemma 16
Let be a discrete memoryless channel with capacity . Given a rate , a channel simulation code consists of
- •
,
- •
.
Let be the conditional pmf of given where and
[TABLE]
There exists a sequence of length- simulation codes where
[TABLE]
We now proceed to prove Theorem 14. By Theorem 5, we only need to show that the very weak edge removal property implies the ordinary strong converse. The basic approach is to use network equivalence to convert a code for noisy network into a code on the noiseless version, then apply Lemma 9 on this noiseless network, and then again use network equivalence to convert back to the noisy network.
Let be the set of pairs of nodes connected by point-to-point links. Recall that by assumption, the directed graph is acyclic. Thus, by [55, Prop. 19.1] we may assign each node a distinct integer where if . For any , let be the capacity of the link from to . Assume without loss of generality that for all . Let , so in particular . Denote and as the input and output respectively of the link . Thus the transmitted symbol from node can be written
[TABLE]
and the received symbol at node can be written
[TABLE]
Let be achievable with respect to fixed . Thus, for sufficiently large , there exists a length- code for network with rate and probability of error . By (9)–(10), this code is defined by encoding functions for each node and time , and decoding functions for each node . It will be useful to work with coding functions on -length blocks rather than single time instances, so we define the block-wise encoding function at node
[TABLE]
as
[TABLE]
Using the notation in (304), we may notate the arguments to this function as
[TABLE]
Due to the network being acyclic, we may form a pipelined block-Markov version of this code as follows. Given integer , we form a code with length and rate . The outer blocklength serves a similar function as it did for network stacking, but here it represents the number of message blocks transmitted subsequently, rather than the number of stacks. Note that message consists of bits, which we denote , each consisting of bits. We then pipeline copies of the original code, encoding -length blocks at a time. In particular, we introduce notation
[TABLE]
Now, we define the coding operations at node by, for all ,
[TABLE]
Recall that if , then , meaning that the arguments of in (310) are causally available. Note that (310) does not specify all channel inputs, namely for ; these channel inputs can be arbitrary, as the corresponding channel outputs will be ignored. To decode at node , for all let
[TABLE]
Observe that the variables associated with a given index associate only with themselves, and behave exactly like the original -length code. Thus, an error occurs on this pipelined code if and only if any of the copies make an error, so the probability of error is
[TABLE]
Thus we have
[TABLE]
Note that in this pipelined code, encoding operations are performed on -length blocks at a time. Thus, the pipelined code on can be converted to one on a deterministic network using channel simulation codes. In particular, fix and let be the network of noiseless links where link is replaced by a noiseless link with capacity . By Lemma 16, for each link there exists a channel simulation code for link of rate and total variational distance at most , where as . For each link , we use copies of the associated channel simulation code to simulate the behavior of link in network using the corresponding link on . We analyze the impact on the overall probability of error from replacing these noisy channels by channel simulation codes as follows. Let by the joint distribution of all channel inputs , channel outputs , messages , and message estimates for the pipelined code on noisy network . Similarly, let be the joint distribution of the same random variables on the code on noiseless network constructed out of channel simulation codes. Note that in the latter, and are not real channel inputs and outputs, but rather simulated inputs and outputs that feed into the channel simulation codes, used to simulate noisy links with noiseless links. Since each channel simulation code used on an -length block for link results in total variational distance at most , we may bound
[TABLE]
The probability of error for the code on the noiseless network differs from that on the original noisy network by at most the quantity in (314). Because total variational distance is an upper bound on the difference in the probability of any event between the two distributions, the probability of error of the resulting code on is at most
[TABLE]
where the inequality holds for sufficiently large , since each sequence vanishes with . Recall that the channel simulation codes described in Lemma 16 employ common randomness between the transmitter and receiver of each link. Thus, a direct application of Lemma 16 implies only the existence of a code achieving the probability in (315) if nodes are allowed common randomness. However, we may treat this common randomness as a randomized codebook, and employ a usual random coding argument to show that there exists at least one deterministic code achieving (315). Hence, for sufficiently large ,
[TABLE]
We now apply Lemma 9 on , to find that for any and for sufficiently large , we have
[TABLE]
where is defined in (42).
Let be the noiseless network where each link is replaced by a noiseless one with capacity . By the assumption that , we always have . We may convert the code on to one on by stretching each block of to one of length
[TABLE]
Thus
[TABLE]
Now we use ordinary noisy channel codes to convert this code back to one on , again one block (now of length ) at a time. For any and sufficiently large , the probability of an error occurring on any of these channel codes can be made at most . Thus we have
[TABLE]
As the above holds for any , we may write
[TABLE]
Since we may take to be arbitrarily large, and arbitrarily small, and we chose to be any achievable vector with respect to , by closure we have
[TABLE]
By the equivalent form of the very weak edge removal property in (27) of Proposition 4, if very weak edge removal holds, then the RHS of (323) equals , so the strong converse holds.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] T. Ho, M. Effros, and S. Jalali, “On equivalence between network topologies,” in Proc. Forty-Eighth Annual Allerton Conference , Monticello, IL, Oct. 2010.
- 2[2] S. Jalali, M. Effros, and T. Ho, “On the impact of a single edge on the network coding capacity,” in Proc. Information Theory and Applications Workshop (ITA) , San Diego, CA, Feb. 2011, pp. 1–5.
- 3[3] E. J. Lee, M. Langberg, and M. Effros, “Outer bounds and a functional study of the edge removal problem,” in Proc. IEEE Information Theory Workshop , Sevilla, Spain, Sep. 2013, pp. 1–5.
- 4[4] S. U. Kamath, D. N. C. Tse, and V. Anantharam, “Generalized network sharing outer bound and the two-unicast problem,” in Proc. International Symposium on Network Coding (Net Cod) , Beijing, China, Jul. 2011.
- 5[5] R. W. Yeung, “A framework for linear information inequalities,” IEEE Trans. Inf. Theory , vol. 43, no. 6, pp. 1924–1934, Nov. 1997.
- 6[6] M. Langberg and M. Effros, “Network coding: Is zero error always possible?” in Proc. Forty-Nine Annual Allerton Conference , Monticello, IL, Sep. 2011, pp. 1–8.
- 7[7] T. H. Chan and A. Grant, “Network coding capacity regions via entropy functions,” IEEE Trans. Inf. Theory , vol. 60, no. 9, pp. 5347–5374, Sept 2014.
- 8[8] M. F. Wong, M. Langberg, and M. Effros, “On a capacity equivalence between network and index coding and the edge removal problem,” in 2013 IEEE International Symposium on Information Theory , July 2013, pp. 972–976.
