Strong Converses Are Just Edge Removal Properties

Oliver Kosut; Joerg Kliewer

arXiv:1706.08172·cs.IT·December 13, 2018

Strong Converses Are Just Edge Removal Properties

Oliver Kosut, Joerg Kliewer

PDF

TL;DR

This paper establishes a fundamental link between edge removal properties and strong converses in network information theory, showing their equivalence under certain conditions and applying this to key network models.

Contribution

It introduces a novel causal blowing-up lemma and proves the equivalence between weak edge removal and exponentially strong converse for discrete memoryless networks.

Findings

01

Weak edge removal implies exponentially strong converse.

02

Exponential strong converse holds for the 2-user interference channel with strong interference.

03

Relations between various notions of edge removal and strong converses are characterized.

Abstract

This paper explores the relationship between two ideas in network information theory: edge removal and strong converses. Edge removal properties state that if an edge of small capacity is removed from a network, the capacity region does not change too much. Strong converses state that, for rates outside the capacity region, the probability of error converges to 1 as the blocklength goes to infinity. Various notions of edge removal and strong converse are defined, depending on how edge capacity and error probability scale with blocklength, and relations between them are proved. Each class of strong converse implies a specific class of edge removal. The opposite directions are proved for deterministic networks. Furthermore, a technique based on a novel, causal version of the blowing-up lemma is used to prove that for discrete memoryless networks, the weak edge removal property--that the…

Tables1

Table 1. TABLE I: Summary of capacity region definitions

$ℛ_{𝒱} (𝒩, n, ϵ, k)$	Finite blocklength rate region for network $𝒩$
$n$	Blocklength
$ϵ$	Average probability of error
$k$	Number of bits carried by edge $(a, b)$ in the modified network as shown in Fig. 2. If omitted then the network is unmodified (i.e., $k = 0$ )
$𝒱$	Set of nodes in $𝒩$ connected to extra nodes $a$ and $b$ . If omitted then $𝒱 = [1 : d]$ ; i.e., $a$ and $b$ connect to all nodes
$𝒞_{𝒱} (𝒩, {(ϵ_{n})}_{n}, {(k_{n})}_{n})$	Asymptotic capacity region for network $𝒩$
${(ϵ_{n})}_{n}$	Probability of error sequence as a function of blocklength $n$ . If replaced by $0^{+}$ then asymptotically vanishing error probability
${(k_{n})}_{n}$	Bit-capacity sequence of edge $(a, b)$ as a function of blocklength $n$ . If omitted then the network is unmodified (i.e., $k_{n} = 0$ for all $n$ )
$𝒱$	See above

Equations532

D (P ∥ Q) = x \in X \sum P (x) lo g \frac{P ( x )}{Q ( x )} .

D (P ∥ Q) = x \in X \sum P (x) lo g \frac{P ( x )}{Q ( x )} .

D (P_{Y ∣ X} ∥ Q_{Y ∣ X} ∣ R_{X}) = x, y \sum R_{X} (x) P_{Y ∣ X} (y ∣ x) lo g \frac{P _{Y ∣ X} ( y ∣ x )}{Q _{Y ∣ X} ( y ∣ x )} .

D (P_{Y ∣ X} ∥ Q_{Y ∣ X} ∣ R_{X}) = x, y \sum R_{X} (x) P_{Y ∣ X} (y ∣ x) lo g \frac{P _{Y ∣ X} ( y ∣ x )}{Q _{Y ∣ X} ( y ∣ x )} .

d_{TV} (P, Q) = \frac{1}{2} x \in X \sum ∣ P (x) - Q (x) ∣.

d_{TV} (P, Q) = \frac{1}{2} x \in X \sum ∣ P (x) - Q (x) ∣.

d_{H} (x^{n}, y^{n}) = ∣ {t \in [1 : n] : x_{t} \neq = y_{t}} ∣.

d_{H} (x^{n}, y^{n}) = ∣ {t \in [1 : n] : x_{t} \neq = y_{t}} ∣.

x + γ = (x_{1} + γ, \dots, x_{n} + γ) .

x + γ = (x_{1} + γ, \dots, x_{n} + γ) .

A + B = {x + y : x \in A, y \in B} .

A + B = {x + y : x \in A, y \in B} .

P_{Y_{1 t}, \dots, Y_{d t} ∣ Y_{1}^{t - 1}, \dots, Y_{d}^{t - 1}, X_{1}^{t}, \dots, X_{d}^{t}} .

P_{Y_{1 t}, \dots, Y_{d t} ∣ Y_{1}^{t - 1}, \dots, Y_{d}^{t - 1}, X_{1}^{t}, \dots, X_{d}^{t}} .

P_{Y_{1 t}, \dots, Y_{d t} ∣ X_{1 t}, \dots, X_{d t}}

P_{Y_{1 t}, \dots, Y_{d t} ∣ X_{1 t}, \dots, X_{d t}}

ϕ_{i t} : [1 : 2^{n R_{i}}] \times Y_{i}^{t - 1} \to X_{i},

ϕ_{i t} : [1 : 2^{n R_{i}}] \times Y_{i}^{t - 1} \to X_{i},

ψ_{ij} : [1 : 2^{n R_{j}}] \times Y_{j}^{n} \to [1 : 2^{n R_{i}}] .

ψ_{ij} : [1 : 2^{n R_{j}}] \times Y_{j}^{n} \to [1 : 2^{n R_{i}}] .

P_{e}^{(n)} = P (\hat{W} \neq = W)

P_{e}^{(n)} = P (\hat{W} \neq = W)

C (N, (ϵ_{n})_{n}) = \overline{n_{0} \in N ⋃ n \geq n_{0} ⋂ R (N, n, ϵ_{n})} .

C (N, (ϵ_{n})_{n}) = \overline{n_{0} \in N ⋃ n \geq n_{0} ⋂ R (N, n, ϵ_{n})} .

α = n \to \infty lim inf - \frac{1}{n} lo g (1 - ϵ_{n}) = n \to \infty lim inf - \frac{1}{n} lo g (1 - \tilde{ϵ}_{n}) .

α = n \to \infty lim inf - \frac{1}{n} lo g (1 - ϵ_{n}) = n \to \infty lim inf - \frac{1}{n} lo g (1 - \tilde{ϵ}_{n}) .

C (N, 0^{+}) = ϵ > 0 ⋂ C (N, (ϵ)_{n}) .

C (N, 0^{+}) = ϵ > 0 ⋂ C (N, (ϵ)_{n}) .

C (N, 0^{+}) = ϵ_{n} = o (1) ⋃ C (N, (ϵ_{n})_{n}) .

C (N, 0^{+}) = ϵ_{n} = o (1) ⋃ C (N, (ϵ_{n})_{n}) .

C (N, (ϵ_{n})_{n}) \subseteq C (N, 0^{+}) + [0, γ]^{d} .

C (N, (ϵ_{n})_{n}) \subseteq C (N, 0^{+}) + [0, γ]^{d} .

n \to \infty lim inf - \frac{1}{n} lo g (1 - ϵ_{n}) \geq \frac{β}{K}

n \to \infty lim inf - \frac{1}{n} lo g (1 - ϵ_{n}) \geq \frac{β}{K}

\alpha(R)=\min_{Q_{X,Y}}\Big{[}D\big{(}Q_{Y|X}\|P_{Y|X}|Q_{X}\big{)}+|R-I_{Q_{X,Y}}(X;Y)|^{+}\Big{]}

\alpha(R)=\min_{Q_{X,Y}}\Big{[}D\big{(}Q_{Y|X}\|P_{Y|X}|Q_{X}\big{)}+|R-I_{Q_{X,Y}}(X;Y)|^{+}\Big{]}

lo g \frac{P _{Y ∣ X} ( y ∣ x )}{P _{Y} ( y )} \leq C for all x, y

lo g \frac{P _{Y ∣ X} ( y ∣ x )}{P _{Y} ( y )} \leq C for all x, y

⌊ \frac{k}{n} t ⌋ - ⌊ \frac{k}{n} (t - 1) ⌋

⌊ \frac{k}{n} t ⌋ - ⌊ \frac{k}{n} (t - 1) ⌋

C_{V} (N, (ϵ_{n})_{n}, (k_{n})_{n}) = \overline{n_{0} \in N ⋃ n \geq n_{0} ⋂ R_{V} (N, n, ϵ_{n}, k_{n})} .

C_{V} (N, (ϵ_{n})_{n}, (k_{n})_{n}) = \overline{n_{0} \in N ⋃ n \geq n_{0} ⋂ R_{V} (N, n, ϵ_{n}, k_{n})} .

C (N, 0^{+}, (k_{n})_{n}) \subseteq C (N, 0^{+}) + [0, γ]^{d} .

C (N, 0^{+}, (k_{n})_{n}) \subseteq C (N, 0^{+}) + [0, γ]^{d} .

C (N, 0^{+}, (δ n)_{n}) \subseteq C (N, 0^{+}) + [0, K δ]^{d} .

C (N, 0^{+}, (δ n)_{n}) \subseteq C (N, 0^{+}) + [0, K δ]^{d} .

δ > 0 ⋂ C (N, 0^{+}, (δ n)_{n}) = C (N, 0^{+})

δ > 0 ⋂ C (N, 0^{+}, (δ n)_{n}) = C (N, 0^{+})

k_{n} = o (n) ⋃ C (N, 0^{+}, (k_{n})_{n}) = C (N, 0^{+}) .

k_{n} = o (n) ⋃ C (N, 0^{+}, (k_{n})_{n}) = C (N, 0^{+}) .

k_{n} : k_{n} \to \infty ⋂ C (N, 0^{+}, (k_{n})_{n}) = C (N, 0^{+})

k_{n} : k_{n} \to \infty ⋂ C (N, 0^{+}, (k_{n})_{n}) = C (N, 0^{+})

ϵ > 0 ⋂ \overline{k \in N ⋃ C (N, (ϵ)_{n}, (k)_{n})} = C (N, 0^{+}) .

ϵ > 0 ⋂ \overline{k \in N ⋃ C (N, (ϵ)_{n}, (k)_{n})} = C (N, 0^{+}) .

k \in N ⋃ C (N, 0^{+}, (k)_{n}) = C (N, 0^{+}) .

k \in N ⋃ C (N, 0^{+}, (k)_{n}) = C (N, 0^{+}) .

R (N, n, ϵ, k) \subseteq R (N, n, 1 - (1 - ϵ) 2^{- k}) .

R (N, n, ϵ, k) \subseteq R (N, n, 1 - (1 - ϵ) 2^{- k}) .

1 - ϵ \leq P (E^{c}) = x_{ab} \in {0, 1}^{k} \sum P (X_{ab} = x_{ab}) P (E^{c} ∣ X_{ab} = x_{ab}) .

1 - ϵ \leq P (E^{c}) = x_{ab} \in {0, 1}^{k} \sum P (X_{ab} = x_{ab}) P (E^{c} ∣ X_{ab} = x_{ab}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Strong Converses Are Just Edge Removal Properties

Oliver Kosut, and Jörg Kliewer O. Kosut is with the School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ 85287 USA (email: [email protected]).J. Kliewer is with the Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102 USA (email: [email protected]).This work was presented in part at the 2016 IEEE International Symposium on Information Theory.This material is based upon work supported by the National Science Foundation under Grant No. CCF-1439465, CCF-1440014, CNS-1526547, CCF-1453718.

Abstract

This paper explores the relationship between two ideas in network information theory: edge removal and strong converses. Edge removal properties state that if an edge of small capacity is removed from a network, the capacity region does not change too much. Strong converses state that, for rates outside the capacity region, the probability of error converges to 1 as the blocklength goes to infinity. Various notions of edge removal and strong converse are defined, depending on how edge capacity and error probability scale with blocklength, and relations between them are proved. Each class of strong converse implies a specific class of edge removal. The opposite directions are proved for deterministic networks. Furthermore, a technique based on a novel, causal version of the blowing-up lemma is used to prove that for discrete memoryless networks, the weak edge removal property—that the capacity region changes continuously as the capacity of an edge vanishes—is equivalent to the exponentially strong converse—that outside the capacity region, the probability of error goes to 1 exponentially fast. This result is used to prove exponentially strong converses for several examples, including the discrete 2-user interference channel with strong interference, with only a small variation from traditional weak converse proofs.

Index Terms: Strong converse, edge removal, network information theory, reduction results, blowing-up lemma.

I Introduction

Consider a general network communication scenario given an arbitrary collection of sources and sinks connected via an arbitrary network channel. The sources are independent and each source is demanded by a subset of sinks, where this subset can be different for each sink. A general interest in network information theory is to determine the capacity of such networks, defined as the set of achievable rates for each source. As this problem is known to be challenging, we consider the simpler problem of how the capacity of these networks change if only a single edge is removed from the network. This problem has first been studied by [1, 2]. The authors have shown that for acyclic noiseless networks and a variety of demand types for which the cut-set bound is tight, removing an edge of capacity $\delta$ reduces the capacity of each min-cut by at most $\delta$ in each dimension. Further, in [3] it has been shown for a noiseless multiple multicast demand that this edge removal property also holds for generalized network sharing outer bound [4]; for the linear programming outer bound [5], [3] shows that removing an edge of capacity $\delta$ reduces the capacity by at most $K\delta$ , where $K$ depends only on the network. In addition, the existence of the edge removal property has for example been tied to the problem whether a network coding instance allows a reconstruction with $\epsilon$ or zero error [6, 7], respectively. Another example is the connection of edge removal to the equivalency between a network coding instance and a corresponding index coding problem [8]. Recently, it has been shown that for a multiple-access channel with a so called “cooperation facilitator” [9, 10, 11, 12, 13] the edge removal property does not hold. In particular, for this setting the authors show the surprising result that adding a small capacity edge can lead to a significant increase in network capacity. These results have also been extended to networks with state [14] and to edges which can carry only a single bit over all times under the maximal error criterion [15]. However, despite the significant progress that has been made to understand scenarios in which the edge removal property holds, the solution to the general problem is open.

In this work, we address the connection of edge removal to the existence of strong converses for networks subject to an average probability of error constraint. As far as we know, this connection has been explored in the literature only briefly in [16, Chap. 3, p. 48]. The strong converse theorem states that the error probability converges to 1 for large blocklengths $n$ if the rate exceeds the capacity. This is in contrast to a weak converse which only indicates that the error probability is bounded away from zero if we operate at a rate beyond capacity. The benefit of a strong converse is that it strengthens the interpretation of capacity as a sharp phase transition in achievable probability of error. It also allows for the following interesting interpretation: if a strong converse exists for a given network instance, $\epsilon$ reliable codes (i.e., codes which allow reconstruction with $\epsilon$ error) must have rate tuples within the capacity region for $\epsilon\in[0,1)$ and large $n$ . Thus, a strong converse refines a capacity (or first-order) result, which provides only the limiting behavior as the probability of error vanishes and the blocklength goes to infinity. However, a strong converse does not provide as much refinement as a second-order (or dispersion) result [17], which clarifies the (usually $O(1/\sqrt{n})$ ) backoff from capacity for small blocklengths and fixed probability of error. Therefore, strong converses constitute “one-and-a-half-th order” results. Strong converses have been established for numerous problems, including point-to-point settings, e.g., for discrete memoryless channels [18] and quantum channels [19, 20]. Recently it has been shown that a strong converse holds for a discrete memoryless networks with tight cut-set bounds [21]. There has also been work establishing exponentially strong converses, which state that for any rate vector outside the asymptotically-zero error capacity region, the error probability approaches 1 exponentially fast. Exponentially strong converses have been considered for point-to-point channels in [22, 23], and for several network problems in [24, 25, 26, 27].

In the following, we categorize the notions of edge removal and strong converses into different classes depending on how edge capacity and error probability, resp., scale with blocklength, and demonstrate relations between these instances. See Fig. 1 for a summary of our results. In particular, our contributions are as follows:

We show that each specific class of strong converse always implies a specific class of edge removal. This implication holds in great generality: whether the network channel model is deterministic or probabilistic, discrete or continuous, or even whether it has memory. 2. 2.

We show that implications in the opposite direction (edge removal implies strong converse) hold in some cases. In particular, we show that each opposite direction holds for deterministic networks. However, these opposite directions do not always hold; for example, for a simple discrete memoryless point-to-point channel, each edge removal property holds, but the strongest form of the strong converse—the extremely strong converse—does not hold. 3. 3.

We further show that for all discrete memoryless stationary networks, the exponentially strong converse is equivalent to the weak edge removal property. The weak edge removal property states that if a small edge with rate growing sublinear in the blocklength is removed, the asymptotically-zero error capacity region does not change. The proof is based on a novel, causal version of the blowing-up lemma [28]. 4. 4.

We demonstrate that for networks composed of independent point-to-point links with acyclic topology, a similar equivalence holds for weaker conditions—between the ordinary strong converse and what we call the very weak edge removal property, wherein the edge carries an unbounded number of bits that grows very slowly with blocklength. 5. 5.

These results, particularly the equivalence between weak edge removal and the exponentially strong converse, enable us to, without much effort, strengthen many existing computable outer bounds or weak converses to prove that they hold in an exponentially strong sense. We demonstrate this for the cut-set bound, reproducing the result of [21] to show that for rates outside the region defined by cut-set bound, the probability of error converges to 1 exponentially fast. We also prove exponentially strong converses for discrete broadcast channels, and for the discrete 2-user interference channel with strong interference.

All the above mentioned reduction results between edge removal and strong converses reveal the surprising fact that for many cases, satisfying edge removal—a condition related only to first-order capacity—implies a seemingly stronger “one-and-a-half-th order” property, namely the existence of a specific version of a strong converse indicated by the leftward arrows in Fig. 1. This highlights again the power of the edge removal property.

This paper is organized as follows. We first introduce the model and definitions of various strong converse and edge removal properties in Sec. II. After that, in Sec. III we prove that strong converses imply edge removal properties. The opposite directions for deterministic networks is then proven in Sec. IV. Then, in Sec. V we prove one of the main results in this paper, namely equivalence between weak edge removal and the exponentially strong converse for discrete stationary memoryless. We then show equivalence between very weak edge removal and the ordinary strong converse for networks of independent point-to-point links in Sec. VI. After that, in Sec. VII we derive several applications of our results, including the cut-set bound, broadcast channels, and interference channel. Finally, Sec. VIII offers the conclusions.

II Model and Definitions

We begin by introducing notation to be used throughout the paper. Subsequently we introduce our network model, and formally define the notions of strong converse and edge removal that will be the main focus, while proving some simple properties of these definitions. There are number of subtly different definitions of rate regions: we summarize them in Table I for convenience.

Notation: For an integer $k$ we define $[1:k]=\{1,\ldots,k\}$ . All logarithms and exponentials have base $2$ . The notation $(a_{n})_{n}$ represents an infinite sequence of values $a_{n}$ for each positive integer $n$ . For sequences $(a_{n})_{n},(b_{n})_{n}$ , we write $a_{n}\doteq b_{n}$ if $\log(a_{n})/n$ and $\log(b_{n})/n$ have the same limit as $n\to\infty$ . Given two probability distributions $P$ and $Q$ on the same alphabet $\mathcal{X}$ , the relative entropy (for discrete distributions) is given by

[TABLE]

Given conditional distributions $P_{Y|X}$ and $Q_{Y|X}$ , and marginal distribution $R_{X}$ , the conditional relative entropy is given by

[TABLE]

The total variational distance (for discrete distributions) is given by

[TABLE]

The Hamming distance between two sequences $x^{n},y^{n}\in\mathcal{X}^{n}$ is denoted

[TABLE]

For a set $\mathcal{A}\subseteq\mathbb{R}^{n}$ , $\overline{\mathcal{A}}$ indicates the closure of $\mathcal{A}$ with respect to the Euclidean distance. We denote the set of nonnegative real numbers by $\mathbb{R}_{+}$ . Given a vector $\mathbf{x}=(x_{1},\ldots,x_{n})\in\mathbb{R}^{n}$ and a scalar $\gamma\in\mathbb{R}$ , we denote the vector-scalar sum as

[TABLE]

Given a sets $\mathcal{A},\mathcal{B}\subseteq\mathbb{R}^{n}$ we denote the set sum as

[TABLE]

II-A Network Model

We begin with a network model for an arbitrary causal network channel. Many of our results apply only for discrete memoryless networks or deterministic networks, but some basic results apply in much more generality.

Consider a network consisting of $d$ nodes, where node $i\in[1:d]$ wishes to convey a message $W_{i}$ at rate $R_{i}$ to a set of destination nodes $\mathcal{D}_{i}\subseteq[1:d]$ .111We assume for simplicity that at most one message originates at each node; all results can be easily generalized to the scenario in which multiple messages originate at each node. The channel model consists of:

•

An input alphabet $\mathcal{X}_{i}$ for each $i\in[1:d]$ ,

•

An output alphabet $\mathcal{Y}_{i}$ for each $i\in[1:d]$ ,

•

For each time step $t$ , a conditional probability measure

[TABLE]

Note that the channel outputs at time $t$ depend on all previous inputs up to time $t$ , and all previous outputs up to time $t-1$ .

Definition 1

A network is memoryless and stationary if the probability measure in (7) can be written as

[TABLE]

and these distributions are the same for all $t$ .

Definition 2

A network is deterministic if the channel outputs at time $t$ are fixed given the channel inputs up to time $t$ ; i.e., the conditional probability distribution in (7) takes values only in $\{0,1\}$ .

Definition 3

A network is discrete if all input and output alphabets are finite sets.222While this is technically an incorrect use of “discrete”, we use it to mean “finite alphabet” as this is the usual convention in the literature; see for example [29, p. 39].

For any $\mathbf{R}=(R_{1},\ldots,R_{d})\in\mathbb{R}_{+}^{d}$ , an $(\mathbf{R},n)$ code consists of:

•

For each node $i\in[1:d]$ and time $t\in[1:n]$ , an encoding function

[TABLE]

•

For each $i,j\in[1:d]$ where $j\in\mathcal{D}_{i}$ , a decoding function

[TABLE]

Assume messages $W_{i}$ for $i=1,\ldots,d$ are independent and each uniformly distributed over $[1:2^{nR_{i}}]$ . The channel input from node $i$ at time $t$ is given by $X_{it}=\phi_{it}(W_{i},Y_{i}^{t-1})$ . For $j\in\mathcal{D}_{i}$ , the estimate of $W_{i}$ at node $j$ is given by $\hat{W}_{ij}=\psi_{ij}(W_{j},Y_{j}^{n})$ . We write $\mathbf{W}$ for the complete vector of messages, and $\hat{\mathbf{W}}$ for the complete vector of message estimates. Given an $(\mathbf{R},n)$ code, the average probability of error is

[TABLE]

where $\hat{\mathbf{W}}\neq\mathbf{W}$ denotes the event that there exists a node $i$ and a message index $j$ such that node $i$ decodes message $j$ incorrectly; that is, $\hat{W}_{ij}\neq W_{j}$ for any $i\in[1:d]$ , $j\in\mathcal{D}_{i}$ . For blocklength $n$ and $\epsilon\in[0,1]$ , let $\mathcal{R}(\mathcal{N},n,\epsilon)\subseteq\mathbb{R}_{+}^{d}$ be the set of rates $\mathbf{R}$ for which there exists an $(\mathbf{R},n)$ code with average probability of error at most $\epsilon$ .333We allow for any $\epsilon\in[0,1]$ in our definitions for maximum generality, even though $\epsilon=1$ is a trivial case in which the rate region is unbounded. Given a sequence $(\epsilon_{n})_{n}$ where $\epsilon_{n}\in[0,1]$ for all $n\in\mathbb{N}$ , we say a rate vector $\mathbf{R}$ is *achievable with respect to * $(\epsilon_{n})_{n}$ if there exists an integer $n_{0}$ such that for all $n\geq n_{0}$ , $\mathbf{R}\in\mathcal{R}(\mathcal{N},n,\epsilon_{n})$ . The capacity region $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})$ is given by the closure of the set of all achievable rate vectors with respect to $(\epsilon_{n})_{n}$ . Alternatively, we may define

[TABLE]

Throughout the paper, we use $\mathcal{R}$ to denote a finite blocklength region, and $\mathcal{C}$ to denote an asymptotic region. (Table I summarizes this notation.) Note that $\mathcal{R}(\mathcal{N},n,\epsilon)$ is defined as a function of the single value $\epsilon$ , whereas $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})$ is a function of the infinite sequence $(\epsilon_{n})_{n}$ .

In principle $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})$ is defined for any sequence $(\epsilon_{n})_{n}$ . However, it will be useful to restrict ourselves to sequences for which $-\frac{1}{n}\log(1-\epsilon_{n})$ has a limit; the following proposition, proved in Appendix A, shows that we may do this without loss of generality for memoryless stationary networks.

Proposition 1

Let $\mathcal{N}$ be any memoryless stationary network. For any $\alpha>0$ , let $(\epsilon_{n})_{n}$ and $(\tilde{\epsilon}_{n})_{n}$ be two sequences where

[TABLE]

Then $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})=\mathcal{C}(\mathcal{N},(\tilde{\epsilon}_{n})_{n})$ .

As consequence of Proposition 1, for any sequence $(\epsilon_{n})_{n}$ where $\alpha=\liminf_{n\to\infty}-\frac{1}{n}\log(1-\epsilon_{n})>0$ , $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})=\mathcal{C}(\mathcal{N},(1-\exp\{-n\alpha\})_{n})$ . Thus, it is enough to focus on sequences $(\epsilon_{n})_{n}$ where either $\epsilon_{n}=1-\exp\{-n\alpha\}$ for some $\alpha>0$ , or $-\log(1-\epsilon_{n})=o(n)$ . Note that the latter includes any sequence converging to a constant in $[0,1)$ .

For fixed $\epsilon$ , $\mathcal{C}(\mathcal{N},(\epsilon)_{n})$ denotes the capacity region with asymptotic error probability $\epsilon$ . With some abuse of notation, define the usual asymptotically-zero-error capacity region as

[TABLE]

Equivalently we may write

[TABLE]

Remark 1

Using average probability of error rather than maximal probability of error in our definition of capacity region is not merely convenient; it is critical to many of our results. Indeed, it is illustrated in [15, 13] that edge removal characteristics are very different with maximal probability of error rather than average, and thus the relationship between edge removal and strong converses in the maximal probability of error context is likely to be different.

We proceed to define 7 different properties: 3 notions of a strong converse and 4 notions of the edge removal property. The relationships that we will prove among these properties are shown in Fig. 1.

II-B Strong Converses

Definition 4

Strong converses are defined in terms of whether, for a given constant $\gamma>0$ and a sequence $(\epsilon_{n})_{n}$ ,

[TABLE]

We say network $\mathcal{N}$ satisfies:

•

the extremely strong converse if for all $\gamma>0$ , (16) holds if $-\log(1-\epsilon_{n})=\frac{\gamma n}{K}$ , where $K$ is a positive constant depending only on the network.

•

the exponentially strong converse if for all $\gamma>0$ , (16) holds for some $(\epsilon_{n})_{n}$ where $-\log(1-\epsilon_{n})=\Theta(n)$ .

•

the strong converse if for all $\gamma>0$ , (16) holds for some $(\epsilon_{n})_{n}$ where $-\log(1-\epsilon_{n})\to\infty$ .

Remark 2

Statements similar to (16) will occur throughout this paper; this condition may be alternatively written as follows: for any $\mathbf{R}\in\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})$ , there exists $\mathbf{R}^{\prime}\in\mathcal{C}(\mathcal{N},0^{+})$ such that $R_{i}\leq R^{\prime}_{i}+\gamma$ for all $i\in[1:d]$ .

Remark 3

One can see immediately that the strong converses are ordered by strength; i.e., the extremely strong converse implies the exponentially strong converse, which in turn implies the ordinary strong converse.

The following proposition gives some equivalent definitions for each of these strong converse properties. It is proved in Appendix B.

Proposition 2

Network $\mathcal{N}$ satisfies the extremely strong converse if and only if there exists a constant $K$ depending only on $\mathcal{N}$ such that either of the following hold:

(a)

For any $\mathbf{R}\notin\mathcal{C}(\mathcal{N},0^{+})$ , any sequence of $(\mathbf{R},n)$ codes has probability of error $(\epsilon_{n})_{n}$ satisfying

[TABLE]

where $\beta$ is the smallest number such that $\mathbf{R}\in\mathcal{C}(\mathcal{N},0^{+})+\beta$ . 2. (b)

For any sequence $(\epsilon_{n})_{n}$ where $1-\epsilon_{n}\doteq 2^{-n\alpha}$ , $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})\subseteq\mathcal{C}(\mathcal{N},0^{+})+[0,K\alpha]^{d}$ . 2. 2.

Network $\mathcal{N}$ satisfies the exponentially strong converse if and only if either of the following hold:

(a)

For all $\mathbf{R}\notin\mathcal{C}(\mathcal{N},0^{+})$ , any sequence of $(\mathbf{R},n)$ codes has probability of error approaching 1 exponentially fast. 2. (b)

For any sequence $(\epsilon_{n})_{n}$ for which $-\log(1-\epsilon_{n})=o(n)$ , $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})\subseteq\mathcal{C}(\mathcal{N},0^{+})$ . 3. 3.

Network $\mathcal{N}$ satisfies the strong converse if and only if any of the following hold:

(a)

For all $\mathbf{R}\notin\mathcal{C}(\mathcal{N},0^{+})$ , any sequence of $(\mathbf{R},n)$ codes has probability of error approaching 1 as $n\to\infty$ . 2. (b)

For all $\epsilon\in(0,1)$ , $\mathcal{C}(\mathcal{N},(\epsilon)_{n})=\mathcal{C}(\mathcal{N},0^{+})$ . 3. (c)

There exists a sequence $(\epsilon_{n})_{n}$ where $\epsilon_{n}\to 1$ and $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})=\mathcal{C}(\mathcal{N},0^{+})$ .

Remark 4

Exponential bounds on the probability of success for rates above capacity for point-to-point channels were first considered in [22]. Later, [23] exactly characterized the optimal exponent of the success probability for rates above capacity. Similar results have been found for network problems in [24, 25, 26, 27]. For point-to-point channels, [23] showed that for a discrete-memoryless point-to-point channel $P_{Y|X}$ with capacity $C$ , for all $R>C$ the optimal probability of error $\epsilon_{n}$ satisfies $1-\epsilon_{n}\doteq 2^{-\alpha(R)n}$ where

[TABLE]

where $Q_{X}$ and $Q_{Y|X}$ are the marginal and conditional distributions derived from $Q_{X,Y}$ respectively, $I_{Q_{X,Y}}(X;Y)$ is the mutual information between $X$ and $Y$ where $(X,Y)\sim Q_{X,Y}$ , and $|\cdot|^{+}$ represents the positive part. Intuitively, $Q_{Y|X}$ represents an empirical conditional distribution; correct decoding is possible if the channel behaves like one with capacity greater than $R$ (i.e. when the second term in (18) is zero), and the first term in (18) is the exponential rate of the probability that channel $P_{Y|X}$ behaves like $Q_{Y|X}$ with input distribution $Q_{X}$ .

This result constitutes an exponentially strong converse in our terminology, since $\alpha(R)>0$ for all $R>C$ , but interestingly it is not an extremely strong converse for many noisy channels. Note that an extremely strong converse is equivalent to $\frac{d\alpha(R)}{dR}\big{|}_{R=C}>0$ . However, as we show in the following proposition (proved in Appendix C) this holds only for very specialized channels.

Proposition 3

Consider a discrete-memoryless point-to-point channel $P_{Y|X}$ with capacity $C$ . Let $P_{Y}$ be the (unique) capacity-achieving output distribution. If

[TABLE]

*then $\alpha(R)=R-C$ . Otherwise, $\frac{d\alpha(R)}{dR}\big{|}_{R=C}=0$ . *

Examples of point-to-point channels that satisfy (19) include:

•

essentially noiseless channels, i.e., where $C=\log\min\{|\mathcal{X}|,|\mathcal{Y}|\}$ ,

•

completely noisy channels, i.e., where $Y$ is independent of $X$ ,

•

noisy typewriter channels, i.e., where $Y=X+Z$ with summation over some group $\mathcal{G}$ , where $Z$ is uniform on a subset of $\mathcal{G}$ and independent of $X$ .

Note also that (19) implies that the channel dispersion is 0 (cf. [17, Thm. 49]), but the converse is not true. In particular, the channel dispersion is 0 if and only if there exists a capacity-achieving input distribution $P_{X}$ such that $\log\frac{P_{Y|X}(y|x)}{P_{Y}(y)}\leq C$ for all $y$ and all $x$ with $P_{X}(x)>0$ . However, (19) can fail to hold if $\log\frac{P_{Y|X}(y|x)}{P_{Y}(y)}>C$ for some pair $x,y$ even if $P_{X}(x)=0$ for all capacity-achieving input distributions $P_{X}$ . (For example, this is the case for channels termed exotic in [17].)

However, most channels of interest do not satisfy (19), including binary symmetric channels and binary erasure channels. Thus, while we are able to show equivalence between the extremely strong converse and the strong edge removal property for deterministic networks (see Fig. 1), this equivalence cannot hold for many noisy networks, as the extremely strong converse simply does not hold.

II-C Edge Removal Properties

For a subset of nodes $\mathcal{V}\subseteq[1:d]$ and an integer $k$ , we define a modified network $\mathcal{N}(\mathcal{V},k)$ , illustrated in Fig. 2, as follows: Start with $\mathcal{N}$ , and add two nodes denoted $a$ and $b$ .444These are special nodes in that messages do not originate at them. Thus the capacity region of $\mathcal{N}(\mathcal{V},k)$ has the same dimension as that of $\mathcal{N}$ . For each node $i\in\mathcal{V}$ , add an infinite capacity link from $i$ to $a$ , and an infinite capacity link from $b$ to $i$ . Finally, add a bit-pipe from $a$ to $b$ that can noiselessly transmit $k$ bits total across the $n$ -length coding block. In the case that $k$ is not an integer multiple of $n$ , this bit-pipe cannot be modeled as a stationary memoryless channel. Instead, we assume that the $k$ bits are scheduled such that after $t$ timesteps, $\lfloor\frac{k}{n}\,t\rfloor$ have been transmitted; that is, at time $t$ , the link is allowed to transmit exactly

[TABLE]

bits.555One could imagine other models, such as where the bit transmission schedule is flexible but chosen in advance by the code, or where the schedule can be chosen at run-time. These model variations are unlikely to impact results, but here we adopt the more restrictive model. Let $\mathcal{R}_{\mathcal{V}}(\mathcal{N},n,\epsilon,k)$ be the set of rate vectors $\mathbf{R}$ such that there exists an $(\mathbf{R},n)$ code on $\mathcal{N}(\mathcal{V},k)$ with average probability at most $\epsilon$ . That is, $\mathcal{R}_{\mathcal{V}}(\mathcal{N},n,\epsilon,k)=\mathcal{R}(\mathcal{N}(\mathcal{V},k),n,\epsilon)$ . Given sequences $(\epsilon_{n})_{n}$ and $(k_{n})_{n}$ where $\epsilon_{n}\in[0,1]$ and $k_{n}\in\mathbb{N}$ , we define $\mathcal{C}_{\mathcal{V}}(\mathcal{N},(\epsilon_{n})_{n},(k_{n})_{n})$ to be the capacity region of the sequence of networks $(\mathcal{N}(\mathcal{V},k_{n}))_{n}$ where $(k_{n})_{n}$ determines the dependence between the capacity of the edge $(a,b)$ and the blocklength. Formally, we define

[TABLE]

For the most part we are interested in the case that $\mathcal{V}=[1:d]$ , so we define for convenience $\mathcal{R}(\mathcal{N},n,\epsilon,k)=\mathcal{R}_{[1:d]}(\mathcal{N},n,\epsilon,k)$ and $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n},(k_{n})_{n})=\mathcal{C}_{[1:d]}(\mathcal{N},(\epsilon_{n})_{n},(k_{n})_{n})$ . We further define $\mathcal{C}_{\mathcal{V}}(\mathcal{N},0^{+},(k_{n})_{n})$ and $\mathcal{C}(\mathcal{N},0^{+},(k_{n})_{n})$ analogously to (14)–(15). For any $(k_{n})_{n}$ , it is certainly true that $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})\subseteq\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n},(k_{n})_{n})$ . Note also that $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n},(0)_{n})=\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n}).$

Roughly, edge removal properties state that for small $k$ , the capacity of network $\mathcal{N}(\mathcal{V},k)$ is not too different from that of $\mathcal{N}$ . To be precise, we define four different versions of this property as follows.

Definition 5

Edge removal properties are defined in terms of whether, for a given constant $\gamma>0$ and a sequence $(k_{n})_{n}$ ,

[TABLE]

We say network $\mathcal{N}$ satisfies:

•

the strong edge removal property if for all $\gamma>0$ , (22) holds for $k_{n}=\frac{\gamma n}{K}$ , where $K$ is a positive constant depending only on the network.

•

the weak edge removal property if for all $\gamma>0$ , (22) holds for some $k_{n}=\Theta(n)$ .

•

the very weak edge removal property if for all $\gamma>0$ , (22) holds for some $k_{n}\to\infty$ .

•

the extremely weak edge removal property if for all $\gamma>0$ , (22) holds for all bounded $k_{n}$ .

Remark 5

One can again see immediately that the edge removal properties are ordered by strength; i.e., the strong property implies the weak property, which implies the very weak property, which implies the extremely weak property.

The following proposition gives several alternative definitions of each of the edge removal properties. It is proved in Appendix D.

Proposition 4

The strong edge removal property holds if and only if there exists a finite positive constant $K$ depending only on the network $\mathcal{N}$ such that for all $\delta>0$ ,

[TABLE] 2. 2.

The weak edge removal property holds if and only if,

[TABLE]

and also if and only if

[TABLE] 3. 3.

The very weak edge removal property holds if and only if

[TABLE]

and also if and only if

[TABLE] 4. 4.

The extremely weak edge removal property holds if and only if

[TABLE]

Remark 6

Most works on the edge removal problem (e.g., [1, 2]) consider removing an arbitrary edge from the network, rather than the specific topology shown in Fig. 2. Most similar to this topology is the notion of a super-source network in [30], which was defined for source coding problems as a network containing a node that can view all sources, and has links to each other node. Another similar notion from the literature is that of the cooperation facilitator [9, 10, 11, 12, 13, 14], which connects to the transmitting nodes (but not the receiving node) in a multiple-access network. We choose the topology in Fig. 2 because it ensures that the link that is added/removed is at least as useful as any other link. That is, when $\mathcal{V}=[1:d]$ , then node $a$ has complete knowledge of every signal sent in the network, so the link $(a,b)$ can be used to simulate any other small-capacity link. In particular, for any network $\mathcal{N}^{\prime}$ consisting of $\mathcal{N}$ supplemented by a link (or multiple links) with total capacity at most $k_{n}$ bits, then $\mathcal{C}(\mathcal{N}^{\prime},(\epsilon_{n})_{n})\subseteq\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n},(k_{n})_{n})$ . One example of such a network $\mathcal{N}^{\prime}$ is one that allows for rate-limited feedback. For this reason, one consequence of edge removal results are outer bounds on networks with rate-limited feedback.

Remark 7

The extremely weak edge removal property, wherein the extra edge carries a bounded number of bits as the blocklength grows, appears in none of our results proving relationships to strong converses. Nevertheless, we have chosen to include this definition because it is a natural one, and indeed the property seems tantalizingly likely to be true for all realistic systems. However, it was shown in [15] that for maximal error probability, there exists a network where the extremely weak property does not hold. This again points to the contrast between average and maximal error probability. In light of our other results, the extremely weak property also presents an interesting question: namely, is it equivalent to some version of a strong converse? Based on our results that for some networks, the very weak edge removal property is equivalent to the ordinary strong converse, if there is an equivalent converse to the extremely weak property, it appears that it would need to be weaker than the ordinary strong converse, but perhaps stronger than the ordinary weak converse. No such property has occurred to us.

III Deriving Edge Removal Properties from Strong Converses

The following theorem states that each of the three strong converse properties implies one of the edge removal properties. This result holds for any causal network channel given by (7).

Theorem 5

For any network $\mathcal{N}$ , the following hold:

The strong converse implies very weak edge removal. 2. 2.

The exponentially strong converse implies weak edge removal. 3. 3.

The extremely strong converse implies strong edge removal.

Statement (2) of this theorem was proved for noiseless networks in [16, Sec. 3.3]. Our proof uses essentially the same principle as theirs, namely converting a code on a network with an extra edge to a code on a network without one by fixing a value sent along this edge, and assuming at all other nodes that this value was sent. The following lemma provides a refined version of this argument, relating the achievable rate regions for the network with and without the extra edge at finite blocklengths.

Lemma 6

For any integers $n$ and $k$ and any $\epsilon\in[0,1]$ ,

[TABLE]

Proof:

Let $\mathbf{R}\in\mathcal{R}(\mathcal{N},n,\epsilon,k)$ , so there is an $n$ -length code with rate vector $\mathbf{R}$ and probability of error at most $\epsilon$ on network $\mathcal{N}([1:d],k)$ . We convert this code to one on network $\mathcal{N}$ as follows. Under the code on $\mathcal{N}([1:d],k)$ , let $X_{ab}$ be the message sent on the link from node $a$ to node $b$ . Recall that $X_{ab}\in\{0,1\}^{k}$ . Let $\mathcal{E}$ be the overall error event for network $\mathcal{N}([1:d],k)$ . We have

[TABLE]

There must be some $x^{*}_{ab}\in\{0,1\}^{k}$ for which

[TABLE]

Construct a code for network $\mathcal{N}$ that behaves exactly like the original code on network $\mathcal{N}([1:d],k)$ , except that all nodes assume that node $b$ received the signal $x^{*}_{ab}$ . Let $\mathrm{P}_{\mathrm{e}}$ be the probability of error for this code. Note that with probability $\mathbb{P}(X_{ab}=x^{*}_{ab})$ , the code’s behavior will be just as if the code on $\mathcal{N}([1:d],k)$ were in effect. Thus

[TABLE]

Therefore $\mathbf{R}\in\mathcal{R}(\mathcal{N},n,1-(1-\epsilon)2^{-k})$ . ∎

Proof:

We first show statement (1). Assume the strong converse holds. Thus

[TABLE]

where (33) follows from Lemma 6; (34) follows from the strong converse, because $1-(1-\epsilon)2^{-k}\in(0,1)$ for any $\epsilon\in(0,1)$ and $k\in\mathbb{N}$ ; and (35) follows because $\mathcal{C}(\mathcal{N},0^{+})$ is closed. Therefore, very weak edge removal holds by the equivalent definition in (27) of Proposition 4.

We now prove statement (2). Assume the exponentially strong converse holds. For any $k_{n}=o(n)$ , we have

[TABLE]

where (36) follows from Lemma 6, (37) from the fact that $k_{n}=o(n)$ , and (38) from the exponentially strong converse. Therefore weak edge removal holds.

We now prove statement (3). Assume the extremely strong converse holds. For any $\delta>0$ we have

[TABLE]

where (39) follows from Lemma 6. Note that $(1-\epsilon)2^{-\delta n}\doteq 2^{-\delta n}$ . Thus if $\mathbf{R}\in\mathcal{C}(\mathcal{N},0^{+},\delta n)$ , then, by the extremely strong converse, $\mathbf{R}-K\delta\in\mathcal{C}(\mathcal{N},0^{+})$ for some constant $K$ . Therefore strong edge removal holds. ∎

IV Deterministic Networks

The following theorem states that for deterministic networks, each implication of Theorem 5 is also an equivalence.

Theorem 7

For any deterministic network $\mathcal{N}$ , the following hold:

The very weak edge removal property holds if and only if the strong converse holds. 2. 2.

The weak edge removal property holds if and only if the exponentially strong converse holds. 3. 3.

The strong edge removal property holds if and only if the extremely strong converse holds.

To prove Theorem 7, we begin with several lemmas. The first is the well-known reverse Markov inequality, which will be instrumental in proving that edge removal properties imply strong converses.

Lemma 8

Let $X$ be a real-valued random variable where $X\leq x_{\max}$ a.s. For any $\tau\leq\mathbb{E}X$ ,

[TABLE]

The following lemma provides the core result that is needed to prove Theorem 7. The proof is adapted from that of [31, Lemma 2].

Lemma 9

Let $\mathcal{N}$ be a deterministic network. For any $\epsilon\in[0,1)$ , any $n\in\mathbb{N}$ , and any $\tilde{\epsilon}\in(0,1)$ ,

[TABLE]

where

[TABLE]

Proof:

Let $\mathbf{R}\in\mathcal{R}(\mathcal{N},n,\epsilon)$ . That is, there exists a code with rate vector $\mathbf{R}$ and blocklength $n$ achieving probability of error $\epsilon$ . The key to the proof is to show that if the rates are reduced slightly from those in $\mathbf{R}$ , then an extra edge allows achieving arbitrarily small probability of error. In particular, given a target probability of error $\tilde{\epsilon}$ , define a rate vector $\tilde{\mathbf{R}}=(\tilde{R}_{1},\ldots,\tilde{R}_{d})$ given by

[TABLE]

where we choose with hindsight (recall $d$ is the number of messages in the network)

[TABLE]

We will proceed prove that

[TABLE]

by constructing a code of rate $\tilde{\mathbf{R}}$ on network $\mathcal{N}([1:d],dk)$ . However, to prove the lemma we need to show that $\mathbf{R}$ , rather than $\tilde{\mathbf{R}}$ , is contained in the right-hand side (RHS) of (41). Given (45) and that $nR_{i}-n\tilde{R}_{i}\leq 2k$ , we may simply expand the edge from node $a$ to $b$ to carry $2dk$ additional bits, adding $2k$ bits for each message, which implies

[TABLE]

This is now enough to prove the lemma, since $3dk\leq\eta(\tilde{\epsilon},d)-3d\log(1-\epsilon)$ where $\eta(\tilde{\epsilon},d)$ is defined in (42).

We now prove (45). For $i=1,\ldots,d$ , let $\mathcal{W}_{i}=[2^{nR_{i}}]$ be the message set for the $i$ th message of the original code of rate $\mathbf{R}$ and probability of error $\epsilon$ , and let

[TABLE]

be the set of complete message vectors $\mathbf{w}=(w_{1},\ldots,w_{d})$ . Let $R=\sum_{i}R_{i}$ , so $|\mathcal{W}|=2^{nR}$ . Since the network is deterministic and the code is fixed, whether or not an error occurs depends entirely on the message vector $\mathbf{w}\in\mathcal{W}$ that is chosen. Let $\Gamma$ be the subset of $\mathcal{W}$ of message vectors that do not lead to errors. Thus the probability of error is precisely $1-2^{-nR}|\Gamma|$ . By the assumption that the probability of error is at most $\epsilon$ , we have that

[TABLE]

Recall that $\tilde{R}_{i}=0$ if $nR_{i}<2k$ , so this message is not significant. For ease of notation, we assume for now that $nR_{i}\geq 2k$ for all messages $i$ , so that $\tilde{R}_{i}=R_{i}-\frac{k}{n}$ . We employ a version of a random binning argument. For each $i$ , randomly choose the sets

[TABLE]

to be a partition of $\mathcal{W}_{i}$ where $|\mathcal{P}_{i}(\tilde{w}_{i})|=2^{k}$ for all $\tilde{w}_{i}\in[1:2^{n\tilde{R}_{i}}]$ , such that all such partitions are equally likely. Furthermore, let $\mathcal{P}(\tilde{\mathbf{w}})$ for $\tilde{\mathbf{w}}=(\tilde{w}_{1},\ldots,\tilde{w}_{d})$ be the set of message vectors $\mathbf{w}\in\mathcal{W}$ such that $w_{i}\in\mathcal{P}_{i}(\tilde{w}_{i})$ for all $i\in[1:d]$ . Given these partitions, the code proceeds as follows. Messages $\tilde{W}_{1},\ldots,\tilde{W}_{d}$ are all transmitted to node $a$ . Node $a$ then chooses a message vector $\mathbf{W}=(W_{1},\ldots,W_{d})$ from the set $\Gamma\cap\mathcal{P}(\tilde{\mathbf{W}})$ in an arbitrary manner. If this set is empty, then we declare an error. For each $i$ , let $I_{i}\in\{1,\ldots,2^{k}\}$ be the index of $W_{i}$ in the set $\mathcal{P}_{i}(\tilde{W}_{i})$ . Node $a$ determines $I_{i}$ for each $i$ and transmits $(I_{1},\ldots,I_{d})$ to node $b$ . Note that the number of bits required is $dk$ .

At the originating source node for message $i$ , $W_{i}$ can be determined from $\tilde{W}_{i}$ and $I_{i}$ . Subsequently, the code proceeds as if $\mathbf{W}$ were the true message vector. When a destination node $j$ produces a message estimate $\hat{W}_{ij}$ , it constructs the final message estimate as the $\widehat{\tilde{W}}_{ij}\in[1:2^{n\tilde{R}_{i}}]$ such that $\hat{W}_{ij}\in\mathcal{P}_{i}\left(\widehat{\tilde{W}}_{ij}\right)$ . Since by assumption $\mathbf{W}\in\Gamma$ , there is no error as long as $\Gamma\cap\mathcal{P}(\tilde{\mathbf{W}})$ is not empty.

For $\tilde{\mathbf{w}}=(\tilde{w}_{1},\ldots,\tilde{w}_{d})$ let

[TABLE]

where the probability is with respect to the random choice of partitions $\mathcal{P}_{i}$ . We proceed to show that $q(\tilde{\mathbf{w}})\leq\tilde{\epsilon}$ for all $\tilde{\mathbf{w}}$ . Thus, the probability of error averaged over both the message vector $\mathbf{W}$ and the random choice of partitions is at most $\tilde{\epsilon}$ . This proves that there exists at least one deterministic code with average probability of error $\tilde{\epsilon}$ .

For each $i\in[1:d-1]$ , define for all $w_{1},\ldots,w_{i-1}$ , the set

[TABLE]

Moreover, define

[TABLE]

We claim that for all $i\in[1:d]$ , if $w_{1},\ldots,w_{i-1}$ is such that $w_{i-1}\in\mathcal{A}_{i-1}(w_{1},\ldots,w_{i-2})$ , then

[TABLE]

To prove this for $i\in[1:d-1]$ , assume $w_{i-1}\in\mathcal{A}_{i-1}(w_{1},\ldots,w_{i-2})$ . Define the random variable

[TABLE]

where as usual $W_{i}$ is uniformly distributed on $[1:2^{nR_{i}}]$ . Note that

[TABLE]

where the inequality follows from the assumption that $w_{i-1}\in\mathcal{A}_{i-1}(w_{1},\ldots,w_{i-2})$ . Hence

[TABLE]

where (59) follows from Lemma 8 and the fact that $X(\cdot)\leq 2^{n(R_{i+1}+\cdots+R_{d})}$ , and (60) follows from (57). This proves (53) for $i\in[1:d-1]$ . For $i=d$ , note that if $w_{d-1}\in\mathcal{A}_{d-1}(w_{1},\ldots,w_{d-2})$ , then by the definitions of $\mathcal{A}_{d-1}$ and $\mathcal{A}_{d}$ ,

[TABLE]

This proves (53) for $i=d$ .

Fix $\tilde{\mathbf{w}}=(\tilde{w}_{1},\ldots,\tilde{w}_{d})$ . For each $i=1,\ldots,d$ , define

[TABLE]

Note that for $\mathbf{w}\in\mathcal{Q}_{d}$ , certainly $w_{i}\in\mathcal{P}_{i}(\tilde{w}_{i})$ for all $i\in[1:d]$ , so $\mathbf{w}\in\mathcal{P}(\tilde{\mathbf{w}})$ . Moreover, since $w_{d}\in\mathcal{A}_{d}(w_{1},\ldots,w_{d-1})$ , by definition $\mathbf{w}\in\Gamma$ . Thus $\mathcal{Q}_{d}\subseteq\Gamma\cap\mathcal{P}(\tilde{\mathbf{w}})$ , so

[TABLE]

To upper bound $\mathbb{P}(\mathcal{Q}_{i}=\emptyset|\mathcal{Q}_{i-1}\neq\emptyset)$ , suppose $\mathcal{Q}_{i-1}\neq\emptyset$ , so there exists some $(w_{1},\ldots,w_{i-1})\in\mathcal{Q}_{i-1}$ . If $\mathcal{Q}_{i}$ is empty, then $\mathcal{P}_{i}(\tilde{w}_{i})\cap\mathcal{A}_{i}(w_{1},\ldots,w_{i-1})=\emptyset$ . Recall that $\mathcal{P}_{i}(\tilde{w}_{i})$ is one set of a random partition of $\mathcal{W}_{i}$ , which is chosen independently of $w_{1},\ldots,w_{i-1}$ . In particular, $\mathcal{P}_{i}(\tilde{w}_{i})$ is chosen uniformly among all subsets of $\mathcal{W}_{i}=[1:2^{nR_{i}}]$ of size $2^{k}$ , so

[TABLE]

Since by assumption $(w_{1},\ldots,w_{i-1})\in\mathcal{Q}_{i-1}$ , we have $w_{i-1}\in\mathcal{A}_{i-1}(w_{1},\ldots,w_{i-2})$ , so we may apply (53) to bound

[TABLE]

Thus

[TABLE]

where (69) follows since $a!/b!\leq a^{a-b}$ for integers $a,b$ , (71) follows since $(1+k)\leq e^{x}$ , (72) follows from the choice of $k$ in (44), (73) follows by the assumption that $R_{i}\geq\frac{2k}{n}$ for all $i$ , and (74) follows since $(1-2^{-k})^{-2^{k}}\leq 4$ for any $k\geq 1$ . This last fact can be seen by noting that $f(x)=-x\ln(1-x^{-1})$ is decreasing in $x$ , which holds because its derivative is given by

[TABLE]

∎

Proof:

Theorem 5 proves that each strong converse property implies the corresponding edge removal property, so we only need to prove the opposite directions.

Suppose the very weak edge removal property holds. For any constant $\epsilon$ , applying Lemma 9 gives

[TABLE]

where the last equality holds by very weak edge removal. Therefore the strong converse holds.

Now suppose the weak edge removal property holds. For any sequence $(\epsilon_{n})_{n}$ where $-\log(1-\epsilon_{n})=o(n)$ , applying Lemma 9 gives

[TABLE]

where (80) follows since for any $\tilde{\epsilon}$ and $d$ , $\eta(\tilde{\epsilon},d)\leq\sqrt{n}$ for sufficiently large $n$ ; and (82) follows from weak edge removal, since $\sqrt{n}-3d\log(1-\epsilon_{n})=o(n)$ . Therefore the exponentially strong converse holds.

Finally, suppose the strong edge removal property holds. For any $\alpha>0$ , let $\epsilon_{n}$ where $1-\epsilon_{n}\doteq 2^{-n\alpha}$ . Applying Lemma 9 gives

[TABLE]

where (83) follows from Prop. 1, (84) follows from Lemma 9, (85) follows because $\eta(\tilde{\epsilon},d)\leq\alpha n$ for sufficiently large $n$ , (86) follows by the definition of $\mathcal{C}(\mathcal{N},0^{+},(k_{n})_{n})$ , and (87) follows by the equivalent form of the strong edge removal property in (23), where $K$ is a finite positive constant depending only on the network. Therefore, this network satisfies equivalent form of the extremely strong converse in Prop. 2 part (1b). ∎

V Discrete Stationary Memoryless Networks

The following is our main theorem for discrete stationary memoryless networks, connecting the exponentially strong converse to the weak edge removal property. In addition, we show that both these properties are equivalent to an even weaker form of the weak edge removal property—namely, where the nodes $a$ and $b$ connect only to transmitting nodes; i.e. those nodes $i$ where $\mathcal{X}_{i}\neq\emptyset$ . (Recall the definition $\mathcal{C}_{\mathcal{V}}(\mathcal{N},(\epsilon_{n})_{n},(k_{n})_{n})$ being the capacity region of the network with nodes $a$ and $b$ connected only to nodes in $\mathcal{V}$ .) This is a generalization of the “cooperation facilitator” model from [9, 10, 11, 12, 13, 14], which connected only to the transmitters in a multiple-access channel, but not the receiver. The intuition behind connecting only to transmitting nodes is that the extra edge is useful when encoding but not decoding. The reason is that when decoding, a node attempts to reconstruct a message, which is available exactly at the message’s source node. Thus, any small amount of information sent from the omniscient node $a$ could equally well be sent from the source node. However, when encoding, the “ideal” transmission may be a function of multiple messages, which are simultaneously available only at the ominscient node $a$ . Therefore, even a small capacity link from $a$ to $b$ could in principle provide significant rate gain by connecting to an encoding node. However, if a node does not transmit, it only decodes and never encodes, so the connection from nodes $a$ and $b$ is not helpful.

Theorem 10

For any discrete stationary memoryless network $\mathcal{N}$ , the following three statements are equivalent:

The exponentially strong converse holds. 2. 2.

The weak edge removal property holds. 3. 3.

For all $\gamma>0$ ,

[TABLE]

for some sequence $k_{n}=\Theta(n)$ , where $\mathcal{V}$ is the set of nodes $i$ such that $\mathcal{X}_{i}\neq\emptyset$ .

Observe that statement 1 of the theorem implies statement 2 by Theorem 5. Note that statement 3 is identical to the definition of the weak edge removal, except that the left-hand side (LHS) of (88) is $\mathcal{C}_{\mathcal{V}}(\mathcal{N},0^{+},(k_{n})_{n})$ instead of $\mathcal{C}(\mathcal{N},0^{+},(k_{n})_{n})$ as in (22); i.e., in the modified network, nodes $a$ and $b$ connect only to the set $\mathcal{V}$ of transmitting nodes rather than all nodes. Since for any $\mathcal{V}\subseteq[1:d]$ , $\mathcal{C}_{\mathcal{V}}(\mathcal{N},0^{+},(k_{n})_{n})\subseteq\mathcal{C}(\mathcal{N},0^{+},(k_{n})_{n})$ , statement 2 of the theorem implies statement 3. Hence it remains only to show that statement 3 implies statement 1. The main tool in doing so will be a modified version of the blowing-up lemma. The blowing-up lemma, originally proved in [32] (see also [28, 33]), has been used in the proof of numerous strong converse results. In some sense our result is a generalization of this technique. The traditional blowing-up lemma is stated as follows.

Lemma 11

Let $X^{n}\in\mathcal{X}^{n}$ be a sequence of independent random variables. Fix $\mathcal{A}\subseteq\mathcal{X}^{n}$ where $P_{X^{n}}(\mathcal{A})=\exp\{-n\gamma_{n}\}$ for a sequence $\gamma_{n}\to 0$ . For any $\ell$ , define the blown-up version of $\mathcal{A}$ as

[TABLE]

where $d_{\text{H}}$ is the Hamming distance. There exists a sequence $\delta_{n}\to 0$ where

[TABLE]

The following is a causal version of the blowing-up lemma. It is stronger than the usual blowing-up lemma, but it follows from a slight modification of Marton’s proof of the blowing-up lemma in [28]. One may view this lemma as a causal version of a transportation-cost inequality [33].

Lemma 12

Let $X^{n}\in\mathcal{X}^{n}$ be a random sequence, not necessarily independent. Fix $\mathcal{A}\subseteq\mathcal{X}^{n}$ . There exists a sequence of conditional distributions $P_{Z_{t}|Y_{t},Z^{t-1}}$ for $t=1,\ldots,n$ such that, if we let $Y^{n}\in\mathcal{X}^{n},Z^{n}\in\mathcal{X}^{n}$ have joint distribution

[TABLE]

then $Z^{n}\in\mathcal{A}$ almost surely, and

[TABLE]

Proof:

Let $\tilde{X}^{n}$ be a random sequence with distribution that of $X^{n}$ conditioned on the set $\mathcal{A}$ . That is,

[TABLE]

For any $t\in[1:n]$ and $z^{t-1}\in\mathcal{X}^{t-1}$ , by [34, Theorem 1] there exists a pair of random variables $X_{t}(z^{t-1}),\tilde{X}_{t}(z^{t-1})$ with joint distribution $P_{X_{t}(z^{t-1}),\tilde{X}_{t}(z^{t-1})}$ such that the marginal distributions satisfy

[TABLE]

and their joint distribution satisfies

[TABLE]

We now define

[TABLE]

Let $Y^{n},Z^{n}$ have distribution given by (91), where $P_{Z_{t}|Y_{t},Z^{t-1}}$ is defined in (97). Note that

[TABLE]

where (98) follows from (91), (99) follows from (94) and (97), and (100) follows from simple rules about joint distributions. Thus

[TABLE]

where (102) holds by (100), (103) holds simply because the summation in (102) represents the marginal distribution of $\tilde{X}_{t}(z^{t-1})$ , and (104) holds by (95). Thus $Z^{n}$ and $\tilde{X}^{n}$ have the same distribution. In particular, since by construction $\tilde{X}^{n}\in\mathcal{A}$ almost surely, also $Z^{n}\in\mathcal{A}$ almost surely. We now have

[TABLE]

where (107) holds by (100), (109) holds by (96), (110) holds by Pinsker’s inequality, (111) holds by concavity of the square root, (112) holds because $Z^{n}$ and $\tilde{X}^{n}$ have the same distribution, (113) holds by the chain rule for relative entropy, and (114) holds because, by (93),

[TABLE]

∎

Remark 8

Lemma 11 can be derived from Lemma 12 as follows. If in Lemma 12, $X^{n}$ is a sequence of independent random variables, then by (91), $Y^{n}$ has the same distribution as $X^{n}$ . Thus

[TABLE]

where (117) holds because $Z^{n}\in\mathcal{A}$ almost surely, (118) holds by Markov’s inequality, and in (119) we have applied (92). Assuming $P_{X^{n}}(\mathcal{A})=\exp\{-n\gamma_{n}\}$ where $\gamma_{n}\to 0$ , if we choose, for example, $\delta_{n}=\gamma_{n}^{1/4}$ , we have $\delta_{n}\to 0$ and

[TABLE]

This proves Lemma 11.

With Lemma 12 in hand, we complete the proof of Theorem 10 with the following lemma.

Lemma 13

*For any discrete stationary memoryless network $\mathcal{N}$ , statement 3 of Theorem 10 implies statement 1. *

Proof:

By the same argument as in the proof of Proposition 4, statement 3 of Theorem 10 is equivalent to

[TABLE]

where again $\mathcal{V}$ is the set of transmitting nodes. By Proposition 2, the exponentially strong converse holds if and only if, for any sequence $(\epsilon_{n})_{n}$ where $-\log(1-\epsilon_{n})=o(n)$ , $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})\subseteq\mathcal{C}(\mathcal{N},0^{+})$ . Thus, to prove the lemma it is enough to show that for any $(\epsilon_{n})_{n}$ where $-\log(1-\epsilon_{n})=o(n)$ , and any $\delta>0$ , $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})\subseteq\mathcal{C}_{\mathcal{V}}(\mathcal{N},0^{+},(\delta n)_{n})$ . Let $\mathbf{R}$ be achievable with respect to $\epsilon_{n}$ . Thus for sufficiently large $n$ there exists an $n$ -length code with average probability of error at most $\epsilon_{n}$ . Let $(\phi_{it},\psi_{ij})$ be the encoding/decoding functions for this code (see (9)–(10)). We describe a new code, illustrated in Fig. 3, achieving the same rate vector with vanishing probability of error on the network $\mathcal{N}(\mathcal{V},\delta n)$ . Note that for any $i\in\mathcal{V}^{c}$ , we have $\mathcal{X}_{i}=\emptyset$ , so if $R_{i}>0$ the probability of success would be exponentially small; thus we must have $R_{i}=0$ .

Network stacking: We adopt the notion of network stacking from [35]. The motivation for our use of network stacking is that it allows us to convert an arbitrary coding operation at a single time instance into a coding operation across a long block, thereby taking advantage of the law of large numbers. In particular, we construct $N$ independent copies of the original $n$ -length code, each with its own messages, using a total of $nN$ channel uses. Each copy is referred to as a “layer”, indexed by an integer $\ell\in[1:N]$ . Unlike a block Markov approach [36], in which one would transmit an $n$ -length block corresponding to the original code in sequence, in the network stacking approach we transmit $N$ copies of a single time instance $t\in[1:n]$ of the original code before moving on to the next one. Thus coding can be done “across the layers”, using the fact that the $N$ copies of any symbol are i.i.d., while maintaining the causal structure of the original code.

We use underlines to indicate symbols on the stacked network. In particular, $\underline{X}_{it}(\ell)$ is the transmitted symbol from node $i$ at time $t$ in layer $\ell$ ; $\underline{X}_{i}^{n}(\ell)$ refers to the $n$ -length sequence of symbols in layer $\ell$ ; $\underline{X}_{it}$ refers to the $N$ -length sequence of symbols at time $t$ in all layers; $\underline{X}_{i}^{n}$ refers to the full $nN$ -length sequence of all layers and time instances. We define $\underline{Y}_{it}(\ell)$ , etc. similarly. Moreover, $\underline{W}_{i}(\ell)$ is the message originating at node $i$ in layer $\ell$ , and $\underline{W}_{i}$ is the complete vector of messages originating at node $i$ across all $N$ layers.

Code phases: Given the original $n$ -length code, we construct an $N$ -fold stacked code as follows, where the precise dependence between $n$ and $N$ is to be determined. The code consists of $2n+2$ phases, each consisting of a number of timesteps. These phases are visualized in Fig. 3. First we have a message coordination phase, followed by $n$ transmission phases alternating with $n$ correction phases, and concluded with a hashing phase. In the message coordination phase, nodes coordinate to choose a message vector in each layer with a relatively large probability of success; this is done in exactly the same manner as for deterministic networks in Lemma 9. Each transmission phase corresponds to one timestep $t\in[1:n]$ in the original code: the layers act independently, each performing the coding functions from the original code at time $t$ . In the following correction phase, node $a$ transmits data to node $b$ , describing replacements for certain received data in sub-network $\mathcal{V}$ . Node $b$ then disperses this data to the nodes in $\mathcal{V}$ ; in subsequent transmission phases, nodes in $\mathcal{V}$ use this replaced data in their coding operations. In the final hashing phase, hashes of all messages are dispersed to all nodes, which allows nodes in $\mathcal{V}^{c}$ to decode. This last phase is necessary because nodes $a$ and $b$ do not connect directly to nodes in $\mathcal{V}^{c}$ ; thus the correction approach applied to the rest of the network does not work here, since node $a$ does not know what signals were received in $\mathcal{V}^{c}$ . Instead, hashes are used to correct any remaining errors in messages decoded in $\mathcal{V}^{c}$ .

The message coordination phase consists of $O(N(-\log(1-\epsilon_{n})+\log n))$ timesteps. Each transmission phase consists of exactly $N$ timesteps, since each layer transmits exactly once. Correction phases have variable lengths, depending on how much correction data is required, but a total of $Nn\gamma_{n}$ timesteps are allocated for all correction phases, where

[TABLE]

The hashing phase consists of $O(\sqrt{\gamma_{n}}nN)$ timesteps. Note that in total, the transmission phases consist of $nN$ timesteps. Recalling that $-\log(1-\epsilon_{n})=o(n)$ , $\gamma_{n}\to 0$ as $n\to\infty$ , so all other phases consist of a negligible number of timesteps.

Message coordination phase: For each message vector $\mathbf{w}$ of the original code, let $P_{c}(\mathbf{w})$ be the probability of correctly decoding $\mathbf{w}$ . Let

[TABLE]

Defining $R=\sum_{i=1}^{d}R_{i}$ , we may lower bound the cardinality of $\Gamma$ by

[TABLE]

where (125) holds by Lemma 8 and the fact that $P_{c}(\mathbf{W})\leq 1$ , and (126) holds since the average probability of error is at most $\epsilon_{n}$ .

In the message coordination phase, we use an identical outer code as in Lemma 9 to ensure that, with high probability, only message vectors in $\Gamma$ are ever used. By the same binning argument as in the proof of Lemma 9, this requires only $O(-\log(1-\epsilon_{n})+\log n)$ bits on the link $(a,b)$ for each layer. Note that nodes $a$ and $b$ are only required to contact the nodes in $\mathcal{V}$ , since nodes in $\mathcal{V}^{c}$ have no message originating at them. We may therefore assume throughout the rest of this argument that $\underline{\mathbf{W}}(\ell)\in\Gamma$ for each $\ell\in[1:N]$ .

Correction codebook: Let $P_{c}(\mathbf{w},y_{\mathcal{V}}^{n})$ be the probability of correct decoding given message vector $\mathbf{w}$ , and channel outputs $y_{\mathcal{V}}^{n}$ at nodes $\mathcal{V}$ . That is,

[TABLE]

where again $\hat{\mathbf{W}}$ is the complete vector of message estimates. Since encoding and decoding functions are assumed to be deterministic (cf. (9)–(10)), channel inputs $X_{\mathcal{V}}^{n}$ are deterministic functions of $Y_{\mathcal{V}}^{n}$ and $\mathbf{W}$ . Thus, the only randomness in the probability in (128) are the channel outputs $Y_{\mathcal{V}^{c}}^{n}$ given the inputs $X_{\mathcal{V}}^{n}$ . Recalling that $\mathcal{X}_{i}=\emptyset$ for $i\in\mathcal{V}^{c}$ , $Y_{\mathcal{V}^{c}}^{n}$ is an independent sequence given $X_{\mathcal{V}}^{n}$ . For each message vector $\mathbf{w}$ of the original $n$ -length code, let

[TABLE]

Note that for any $\mathbf{w}\in\Gamma$ ,

[TABLE]

Thus, applying Lemma 8 to the random variable $P_{c}(\mathbf{w},Y_{\mathcal{V}}^{n})$ gives

[TABLE]

We now apply Lemma 12 to the distribution $P_{Y_{\mathcal{V}}^{n}|\mathbf{W}=\mathbf{w}}$ and the set $\mathcal{Q}(\mathbf{w})$ to find conditional distributions $P_{Z_{\mathcal{V},t}|Y_{\mathcal{V},t},Z_{\mathcal{V},t}}$ for all $t=[1:n]$ . Note that these distributions depend on the message vector $\mathbf{w}$ . For each $y_{\mathcal{V},t}\in\mathcal{Y}_{\mathcal{V}}$ and $z^{t-1}\in\mathcal{Y}_{\mathcal{V}}^{t-1}$ , independently draw

[TABLE]

These functions constitute a codebook known to all nodes.

Hashing codebook: For each $i\in\mathcal{V}$ and each $\underline{w}_{i}\in[1:2^{nR_{i}}]^{N}$ , independently and uniformly draw $g_{i}(\underline{w}_{i})$ from $[1:2^{nN\sqrt{\gamma_{n}}}]$ . These hashing functions also constitute a codebook known to all nodes.

Transmission phases: Before the transmission phase at time $t$ , each node $i\in\mathcal{V}$ has determined $\underline{Z}_{i}^{t-1}\in\mathcal{Y}_{i}^{t-1}$ , which represent the corrected versions of its received signals (see description below of the correction phases). For each $\ell\in[1:N]$ , node $i$ determines and transmits

[TABLE]

For each $i\in[1:d]$ , let $\underline{Y}_{i,t}(\ell)$ be the corresponding received signals.

Correction phases: In the correction phase after the transmission phase at time $t$ , node $a$ learns $\underline{Y}_{i,t}$ from each $i\in\mathcal{V}$ , and determines, for each $\ell\in[1:N]$ ,

[TABLE]

For each $\ell$ for which $\underline{Z}_{\mathcal{V},t}(\ell)\neq\underline{Y}_{\mathcal{V},t}(\ell)$ , node $a$ transmits to node $b$ a bit string with [math] followed by $\lceil\log N|\mathcal{Y}|\rceil$ bits identifying the layer $\ell\in[1:N]$ as well as the value of $\underline{Z}_{\mathcal{V},t}(\ell)\in\mathcal{Y}_{\mathcal{V}}$ . After doing this for each layer where $\underline{Z}_{\mathcal{V},t}(\ell)\neq\underline{Y}_{\mathcal{V},t}(\ell)$ , node $a$ transmits the stop bit $1$ , signaling that all nodes should proceed to the next transmission phase. Node $b$ then forwards this data to each node $i\in\mathcal{V}$ . For all layers $\ell$ for which no correcting signal was sent, each node $i\in\mathcal{V}$ simply sets $\underline{Z}_{it}(\ell)=\underline{Y}_{it}(\ell)$ .

Hashing phase: Node $a$ computes $g_{i}=g_{i}(\underline{w}_{i})$ for all $i\in\mathcal{V}$ , and transmits these values to node $b$ , which subsequently disperses them to nodes in $\mathcal{V}$ .666One could also compute the hash for message $i$ directly at node $i$ , and distribute the hash to all decoder nodes from there. We choose to compute the hash at node $a$ makes merely to make distribution of the hashes simpler to describe. Note that these hashes consist of a total of $d\sqrt{\gamma_{n}}nN$ bits, which is sub-linear in $nN$ . Thus they can be transmitted over the link $(a,b)$ as long as $\delta>0$ . For each node $i\in\mathcal{V}^{c}$ , if there exists a node $j\in\mathcal{V}$ where the point-to-point channel from $X_{j}$ to $Y_{i}$ has positive capacity, then we use a point-to-point channel code to transmit the hashes from node $j$ to node $i$ . If there is no such node $j\in\mathcal{V}$ , then all received signals at node $i$ are independent of the rest of the network, so node $i$ cannot decode any messages; in particular, if $i\in\mathcal{D}_{k}$ for any $k\in[1:d]$ , it must be that $R_{k}=0$ . Since the hashes occupy a sub-linear number of bits, transmitting these hashes to each node in $\mathcal{V}^{c}$ takes a sub-linear number of timesteps, and can be done with arbitrarily small probability of error.

Decoding: For each $i,j\in\mathcal{V}$ where $j\in\mathcal{D}_{i}$ and each $\ell\in[1:N]$ , node $j$ determines

[TABLE]

Now consider $i\in[1:d]$ and $j\in\mathcal{V}^{c}\cap\mathcal{D}_{i}$ and each $i\in[1:d]$ where $j\in\mathcal{D}_{i}$ . Given $\underline{Y}_{j}^{n}$ and $g_{i}$ , find the unique $\underline{\hat{w}}_{i}$ where $g_{i}=g_{i}(\underline{\hat{w}}_{i})$ and there exists $\underline{\tilde{y}}_{i}^{n}$ where $\psi_{ij}(\underline{W}_{j}(\ell),\underline{\tilde{y}}_{j}^{n}(\ell))=\hat{\underline{w}}_{i}(\ell)$ for each $\ell\in[1:N]$ and

[TABLE]

If there is no such $\underline{\hat{w}}_{i}$ or more than one, declare an error.

Probability of error analysis: Consider the following error events

[TABLE]

and, for $i\in[1:d]$ and $j\in\mathcal{V}^{c}\cap\mathcal{D}_{i}$ ,

[TABLE]

Note that as long as $\mathcal{E}_{1}$ does not occur, then by Lemma 12, $\underline{Z}_{\mathcal{V}}^{n}(\ell)\in\mathcal{Q}(\underline{\mathbf{W}}(\ell))$ for all $\ell$ . By the definition of $\mathcal{Q}(\mathbf{w})$ , this ensures that $W_{ji}=w_{i}$ for all $j\in[1:d]$ and $i\in\mathcal{V}$ . Events $\mathcal{E}_{2ij},\mathcal{E}_{3ij}$ cover all errors that can occur at nodes in $\mathcal{V}^{c}$ . Hence the probability of error of the overall code, averaged over random coding choices, is

[TABLE]

We first consider $\mathcal{E}_{1}$ . The number of bits transmitted across link $(a,b)$ during the correction phase at time $t$ is

[TABLE]

where the final $+1$ accounts for the stop bit. Thus the number of bits transmitted during all $n$ correction phases is

[TABLE]

Recall link $(a,b)$ has capacity $\delta>0$ , meaning it can transmit a bit roughly every $1/\delta$ timesteps (cf. (20)). Thus we can bound $\mathcal{E}_{1}$ by

[TABLE]

where (147) follows from Markov’s inequality, (148) follows from Lemma 12, where we have dropped the constant $\frac{1}{2\log e}$ since it is less than $1$ , (149) from the assumption that $\mathbf{W}(\ell)\in\Gamma$ for all $\ell$ , and (150) from the definition of $\gamma_{n}$ in (122). If we choose $N=\gamma_{n}^{-2}$ , then

[TABLE]

which vanishes since $-\gamma_{n}\log\gamma_{n}\to 0$ as $\gamma_{n}\to 0$ .

Now we consider events $\mathcal{E}_{2ij},\mathcal{E}_{3ij}$ . Recall that if $\mathcal{E}_{1}$ does not occur, then $\underline{Z}_{\mathcal{V}}^{n}(\ell)\in\mathcal{Q}(\underline{\mathbf{W}}(\ell))$ for all $\ell$ . By the definition of $\mathcal{Q}(\mathbf{w})$ in (129), we have, for any $y_{\mathcal{V}}^{n}\in\mathcal{Q}(\mathbf{w})$

[TABLE]

Note that given $Y_{\mathcal{V}}^{n}=y_{\mathcal{V}}^{n}$ and $\mathbf{W}=\mathbf{w}$ , $X_{\mathcal{V}}^{n}$ is determined since coding functions are deterministic. Since $\mathcal{X}_{i}=\emptyset$ for all $i\in\mathcal{V}^{c}$ , this conditioning also determines $X_{1:d}^{n}$ . Thus, the distribution $P_{Y_{\mathcal{V}^{c}}^{n}|Y_{\mathcal{V}}^{n}=y_{\mathcal{V}}^{n},\mathbf{W}=\mathbf{w}}$ is independent. Applying the blowing up lemma to this distribution and the set of $y_{\mathcal{V}^{c}}$ that cause all messages to be decoded correctly in $\mathcal{V}^{c}$ , there exists a random sequence $Z_{\mathcal{V}^{c}}^{n}\in\mathcal{Y}_{\mathcal{V}^{c}}^{n}$ that causes all messages to be decoded correctly, and

[TABLE]

In particular, if we produce $N$ copies of this $Z_{\mathcal{V}^{c}}^{n}$ sequence for each layer, then Markov’s inequality gives

[TABLE]

In particular, for each $i\in[1:d]$ and $j\in\mathcal{V}^{c}\cap\mathcal{D}_{i}$ , with probability at least $1-\gamma_{n}$ , there exists $\underline{\tilde{y}}_{j}^{n}$ that satisfies the Hamming distance condition (138), and is decoded correctly to $w_{i}$ . Thus $\mathbb{P}(\mathcal{E}_{2ij}|\mathcal{E}_{1}^{c})$ vanishes. We now consider $\mathcal{E}_{3ij}$ . The number of messages $\underline{w}_{j}^{\prime}$ that are considered is upper bounded by the number of sequences $\underline{\tilde{y}}^{n}$ satisfying (138), which is given by

[TABLE]

where $H(\cdot)$ is the binary entropy function. The probability that any given $\underline{w}_{j}^{\prime}\neq\underline{W}_{j}$ agrees with the hash value $g_{j}$ is $2^{-nN\sqrt{\gamma_{n}}}$ , so

[TABLE]

where (159) holds for sufficiently large $n$ , since $\gamma_{n}\to 0$ and $\lim_{p\to 0}H(p)/\sqrt{p}=0$ , and (160) holds again by the choice $N=\gamma_{n}^{-2}$ . Since $n\gamma^{-3/2}\to\infty$ as $n\to\infty$ , $\mathbb{P}(\mathcal{E}_{3ij}|\mathcal{E}_{1}^{c})$ vanishes. ∎

Remark 9

The blowing-up lemma does not appear to be strong enough to prove that the very weak edge removal property implies the ordinary strong converse. Were we to apply the same argument above to the case $\epsilon_{n}=\epsilon\in(0,1)$ , in the key application of the blowing-up lemma in (148), we would have

[TABLE]

This suggests that at least $O(\sqrt{n})$ bits per layer would be required on the extra link. However, very weak edge removal requires that we achieve the same capacity region using any $k_{n}$ sequence of bits converging to infinity, which includes sequences growing smaller than $\sqrt{n}$ .

VI Networks of Independent Point-to-Point Links

We now consider the setting of network equivalence [35], in which $\mathcal{N}$ consists of a stationary memoryless network made up of independent point-to-point (noisy) links. Let $\bar{\mathcal{N}}$ be the same network in which each noisy point-to-point link is replaced by a noiseless bit-pipe of the same capacity. The basic result of network equivalence states that $\mathcal{C}(\mathcal{N},0^{+})=\mathcal{C}(\bar{\mathcal{N}},0^{+})$ . Theorem 10 already asserts that for such networks, the weak edge removal property holds if and only if the exponentially strong converse holds. The following theorem proves that, for such networks with acyclic topology, the same holds for the “lower level” in Fig. 1; i.e., the very weak edge removal property and the ordinary strong converse. The proof, given in Appendix E, makes use of the network equivalence principle to connect codes on $\mathcal{N}$ to codes on $\bar{\mathcal{N}}$ , and then applies Theorem 7 on $\bar{\mathcal{N}}$ .

Theorem 14

For a discrete stationary memoryless network $\mathcal{N}$ consisting of independent point-to-point links with acyclic topology, the very weak edge removal property holds if and only if the strong converse holds.

VII Applications

VII-A Outer Bounds

Consider any outer bound $\mathcal{R}_{\text{out}}(\mathcal{N})$ for the memoryless stationary network $\mathcal{N}$ ; i.e. where $\mathcal{C}(\mathcal{N},0^{+})\subseteq\mathcal{R}_{\text{out}}(\mathcal{N})$ . Suppose we could show

[TABLE]

where as usual $\mathcal{V}$ is the set of nodes $i$ where $\mathcal{X}_{i}\neq\emptyset$ . In other words, the outer bound is continuous with respect to the capacity of the extra edge; that is, the outer bound satisfies a weak edge removal property. Then, applying Lemma 13, we immediately find

[TABLE]

This suggests that the outer bound holds in an exponentially strong sense; that is, for any rate vector outside $\mathcal{R}_{\text{out}}(\mathcal{N})$ , the probability of error approaches 1 exponentially fast.

An outer bound may also satisfy a strong edge removal property, meaning that for some constant $K$ and any $\delta$ ,

[TABLE]

We have no equivalence between the strong edge removal property and the extremely strong converse for general noisy networks, but we do for deterministic networks. Thus, applying Lemma 9, if a deterministic network satisfies (164), then the outer bound holds in an extremely strong sense; that is, for any rate vector outside $\mathcal{R}_{\text{out}}(\mathcal{N})$ , the probability of error approaches 1 at an exponential rate linear in the distance to the outer bound.

For many outer bounds (indeed, almost every computable outer bound that we know of), (162) can be proved without much difficulty, and in some cases the stronger statement (164) can be proved as well. This implies that most outer bounds for discrete memoryless networks hold in an exponentially strong sense, and many outer bounds for deterministic networks hold in an extremely strong sense. We illustrate this for several outer bounds (or weak converse arguments) in the next few subsections.

VII-B Cut-set Bound

Recall that the cut-set outer bound [37] is given by $\mathcal{C}(\mathcal{N},0^{+})\subseteq\mathcal{R}_{\text{cut-set}}(\mathcal{N})$ where

[TABLE]

In the following, we prove (164) for this bound. This allows us to reproduce the result of [21], that the cut-set bound holds in an exponentially strong sense: that is, for any rate vector outside $\mathcal{R}_{\text{cut-set}}(\mathcal{N})$ , the probaility of error goes to 1 exponentially fast. This further implies that any network with a tight cut-set bound (i.e., where $\mathcal{C}(\mathcal{N},0^{+})=\mathcal{R}_{\text{cut-set}}(\mathcal{N})$ ) satisfies the exponentially strong converse. Furthermore, we conclude that for deterministic networks, the cut-set bound holds in an extremely strong sense.

Fix some sequence $(k_{n})_{n}$ , and let $\mathbf{R}\in\mathcal{C}(\mathcal{N},0^{+},(k_{n})_{n})$ . Consider a code achieving this rate vector, and let $Z_{t}$ be the symbol sent along edge $(a,b)$ at time $t$ , or $\emptyset$ if there is no symbol at time $t$ . Note $H(Z^{n})\leq k_{n}$ . Fix any cut set $\mathcal{S}\subseteq[1:d]$ , and let $\mathcal{S}^{c}=[1:d]\setminus\mathcal{S}$ . Also let $\mathcal{T}$ be the set of message flows that cross the cut; that is, the set of $i\in\mathcal{S}$ where $\mathcal{D}_{i}\cap\mathcal{S}^{c}\neq\emptyset$ . We may write

[TABLE]

where (167) follows from Fano’s inequality, where $\epsilon_{n}\to 0$ as $n\to\infty$ ; (169) follows since $X_{\mathcal{S}^{c},t}$ is a function of $Y_{\mathcal{S}^{c}}^{t-1}$ and $Z^{t-1}$ ; (172) follows from the memorylessness and causality of the network model; and (173) follows by defining $Q\sim\text{Unif}[1:n]$ , $X_{i}=X_{i,Q}$ , and $Y_{i}=Y_{i,Q}$ , and by the fact that $H(Z^{n})\leq k_{n}$ . Recalling that $\epsilon_{n}\to 0$ , we have

[TABLE]

In particular, (164) holds with $K=1$ . This in turn implies (162). Therefore, for discrete memoryless stationary networks, the cut-set bound holds in an exponentially strong sense, and for deterministic networks, the cut-set bound holds in an extremely strong sense.

These facts allow us to immediately derive strong converse results for various problems for which the cut-set bound is tight. For example:

since the cut-set bound is tight for relay channels that are degraded, reversely degraded [36], or semideterministic [38], the exponentially strong converse holds. 2. 2.

since the cut-set bound is tight for linear finite-field deterministic multicast networks [39], the extremely strong converse holds.

VII-C Broadcast Channel

A broadcast channel is a network where $\mathcal{Y}_{1}=\emptyset$ , $\mathcal{X}_{i}=\emptyset$ for all $i>1$ , and we allow multiple messages to originate at node 1, each to be decoded at a subset of nodes in $[2:d]$ . Note that this model includes scenarios where there are private messages, public messages, and/or messages intended for some decoders but not all. We claim that the weak edge removal property and the exponentially strong converse hold for discrete memoryless broadcast channels. Indeed, the $\mathcal{V}$ set in Theorem 10 is simply $\{1\}$ . Thus, for any sequence $(k_{n})_{n}$ (whether or not it is $o(n)$ ), $\mathcal{C}_{\{1\}}(\mathcal{N},0^{+},(k_{n})_{n})=\mathcal{C}(\mathcal{N},0^{+})$ , simply because if the extra nodes $a$ and $b$ can only communicate with node $1$ , then any processing done at nodes $a$ and $b$ can simply be reproduced internally at node 1. Theorem 10 immediately proves the claim.

For degraded broadcast channels, the strong converse was proved in [32], and the exponentially strong converse in [40]. However, since the capacity of the broadcast channel in general is unknown, strong converses for general broadcast channels have received little attention. As far as we know, this is the first strong (or exponentially strong) converse that has been proved for a problem for which the capacity region has no known single-letter characterization. In [41], a strong converse was established for a common randomness generation problem for which a single-letter characterization was established in [42]; this strong converse generalizes to non-discrete alphabets, including sources where the single-letter characterization has no known computable characterization, because of an auxiliary random variable. Both the result of [41] and our result on the broadcast channel are examples of strong converses for problems with no known computable rate region. The simplicity of the above proof on the broadcast channel, once we have Theorem 10, is particularly noteworthy.

VII-D Discrete 2-User Interference Channel with Strong Interference

A 2-user interference channel, illustrated in Fig. 4, is a network with 4 nodes, where $\mathcal{Y}_{1}=\mathcal{Y}_{2}=\mathcal{X}_{3}=\mathcal{X}_{4}=\emptyset$ , $\mathcal{D}_{1}=\{3\}$ , and $\mathcal{D}_{2}=\{4\}$ . Note that, to be consistent with the notation in the rest of the paper, the received symbol by the node decoding the first message is $Y_{3}$ , rather than $Y_{1}$ , as it is typically denoted.

Recall that an interference channel has strong interference [43] if

[TABLE]

for all $P_{X_{1}}(x_{1})P_{X_{2}}(x_{2})$ . The capacity region of the interference channel in this regime was found in [44] to be the set of rate pairs $(R_{1},R_{2})$ such that

[TABLE]

for some $P_{Q}(q)P_{X_{1}|Q}(x_{1}|q)P_{X_{2}|Q}(x_{2}|q)$ with $|\mathcal{Q}|\leq 4$ .

The following proposition establishes the exponentially strong converse under strong interference. The strong converse for the interference channel with very strong interference (in addition to fixed-error second-order results) was derived in[45]. The strong converse for the Gaussian interference channel with strong interference was proved in [46].

Proposition 15

For an interference channel with strong interference, weak edge removal and the exponentially strong converse hold.

Proof:

Note that the only nodes $i$ in an interference channel where $\mathcal{X}_{i}\neq\emptyset$ are the encoder nodes, i.e. nodes $1$ and $2$ . Thus, by Theorem 10, to prove the proposition it is enough to show that for any $k_{n}=o(n)$ , $\mathcal{C}_{\{1,2\}}(\mathcal{N},0^{+},(k_{n})_{n})\subseteq\mathcal{C}(\mathcal{N},0^{+})$ , where $\mathcal{C}(\mathcal{N},0^{+})$ is the region defined in (177)–(179).

We claim that an interference channel with strong interference also satisfies (176) for any joint distribution $P_{X_{1},X_{2}}$ , even when $X_{1},X_{2}$ are not independent. Consider any joint distribution $P_{X_{1},X_{2}}$ . For fixed $x_{2}$ , define $\tilde{X}_{1},\tilde{X}_{2}$ where $\tilde{X}_{1}\sim P_{X_{1}|X_{2}=x_{2}}$ and $\tilde{X}_{2}=x_{2}$ deterministically. Since $\tilde{X}_{2}$ is deterministic, $\tilde{X}_{1}$ and $\tilde{X}_{2}$ are trivially independent, so by (176) we have

[TABLE]

where $\tilde{Y}_{3},\tilde{Y}_{4}$ represent the outputs of the channel with $\tilde{X}_{1},\tilde{X}_{2}$ as inputs. Note that $P_{\tilde{X}_{1},\tilde{Y}_{3},\tilde{Y}_{4}}=P_{X_{1},X_{3},Y_{4}|X_{2}=x_{2}}$ . Thus $I(\tilde{X}_{1};\tilde{Y}_{3}|\tilde{X}_{2})=I(X_{1};Y_{3}|X_{2}=x_{2})$ and $I(\tilde{X}_{1};\tilde{Y}_{4}|\tilde{X}_{2})=I(X_{1};Y_{4}|X_{2}=x_{2})$ , so by (180)

[TABLE]

Since (181) holds for any $x_{2}$ , we have

[TABLE]

Similar reasoning establishes the second inequality in (176) for any $P_{X_{1},X_{2}}$ . This proves the claim.

Now, by the same proof as the lemma in [44] for the independent case, for any $P_{X_{1}^{n},X_{2}^{n}}$ ,

[TABLE]

where

[TABLE]

Consider $(R_{1},R_{2})\in\mathcal{C}_{\{1,2\}}(\mathcal{N},0^{+},(k_{n})_{n})$ where $k_{n}=o(n)$ . Thus, there exists a sequence of codes with rates $(R_{1},R_{2})$ , with vanishing probability of error, on the modified network with an extra edge carrying $k_{n}$ bits as a function of the blocklength $n$ . Given a code of blocklength $n$ , let $Z_{t}$ be the signal sent on the edge $(a,b)$ at time $t\in[1:n]$ . Note that, since $k_{n}=o(n)$ , for most values of $t\in[1:n]$ , no bit is transmitted across $(a,b)$ at time $t$ (cf. the transmission schedule in (20)); for these $t$ we simply take $Z_{t}$ to be null. Certainly $H(Z^{n})\leq k_{n}$ . Since for $j=1,2$ , $X_{j}^{n}$ is a function of message $W_{j}$ and $Z^{n}$ , we have

[TABLE]

where (190) follows since the messages are assumed to be independent. Since node $a$ only has access to $W_{1},W_{2}$ , we have the Markov chain

[TABLE]

We now write

[TABLE]

where in (195) we have used the fact that $H(Z^{n})\leq k_{n}$ , and Fano’s inequality, where $\epsilon_{n}\to 0$ as $n\to\infty$ , and (197) holds by the Markov chain in (192). Similarly

[TABLE]

We also have

[TABLE]

where in (205) we have again used the Markov chain in (192). Combining (198) with (207) gives

[TABLE]

where (209) follows from (185). We may also repeat this argument to find (210) with $Y_{3}$ replaced by $Y_{4}$ . To summarize,

[TABLE]

One can see that this is precisely the region for the interference channel when both messages are required to be decoded at both decoders, except that we have close-to-independence instead of exact independence. The difficulty with condition (214) is not just that $X_{1}^{n},X_{2}^{n}$ are not perfectly independent, but that the dependence between individual letters $X_{1,t},X_{2,t}$ may vary depending on $t$ . The method of Dueck in [47] (also similar to Ahlswede’s “wringing” technique [48]) allows us to show that for most $t\in[1:n]$ , the letters $X_{1,t},X_{2,t}$ are nearly independent. This will allow single-letterization of the region in (211)–(214). In particular, there exist some $m\leq\sqrt{nk_{n}}$ and $t_{1},\ldots,t_{m}\in[1:n]$ , where for all $t\in[1:n]$

[TABLE]

where

[TABLE]

We reproduce the essential proof of this fact from [47] as follows. First, let

[TABLE]

If $\mathcal{T}_{1}$ is empty, then we may take $m=0$ and we are done. Otherwise, let $t_{1}$ be any element of $\mathcal{T}_{1}$ . We may write

[TABLE]

where (220) follows from (214) and the fact that $t_{1}\in\mathcal{T}_{1}$ as defined in (217). Next, let

[TABLE]

If $\mathcal{T}_{2}$ is empty, then we may take $m=1$ and again we are done. Otherwise, take $t_{2}$ to be any element of $\mathcal{T}_{2}$ , and proceed as above. This process must terminate after a finite number (say $m$ ) of steps, at which point (215) must hold for all $t$ . By a similar argument as in (218)–(220), for each $i\in[1:m]$

[TABLE]

and in particular

[TABLE]

Since the mutual information is nonnegative, we have $m\leq\sqrt{nk_{n}}$ .

We now have

[TABLE]

where

[TABLE]

Applying (211), and performing similar analyses for (212)–(213), combined with (215), we have

[TABLE]

Using standard tools to bound the cardinality of auxiliary random variables (e.g., [29, Appendix C]), for each $n$ , there exists a joint distribution $P^{(n)}_{QX_{1}X_{2}}$ with $|\mathcal{Q}|\leq 5$ that preserves the value of each mutual information quantity in (231)–(234). Recall that we started with a different code for each blocklength $n$ , so the above procedure results in a different joint distribution $P^{(n)}_{QX_{1}X_{2}}$ for each $n$ . This constitutes a sequence of joint distributions on a compact set, so there exists a convergent subsequence, with limit $P_{QX_{1}X_{2}}$ . Since $k_{n}=o(n)$ , $\epsilon_{n}\to 0$ , and mutual information is continuous for fixed alphabets, this limiting distribution must satisfy (177)–(179); moreover, in the limit (234) implies that $I(X_{1};X_{2}|Q)=0$ , we may factor the joint distribution as $P_{Q}P_{X_{1}|Q}P_{X_{2}|Q}$ . Finally, we may further reduce the cardinality of the auxiliary random variable in (177)–(179) to $|\mathcal{Q}|\leq 4$ . ∎

VIII Conclusions

This paper explored the relationship between edge removal properties and strong converses. Our main results are summarized in Fig. 1. We found three main levels of properties for both edge removal and strong converse, and showed that for a very large class of networks, the strong converse property implies the corresponding edge removal property. Implications in the opposite direction hold for deterministic networks and sometimes for memoryless stationary networks.

Our strongest results are those for the “middle” level in Fig. 1, connecting the weak edge removal property to the exponentially strong converse. In particular, we showed that these properties are equivalent for all discrete memoryless stationary networks. Thus, if an existing weak converse or outer bound can be strengthened to show that it still holds in the presence of an extra link carrying a sub-linear number of bits, then the converse or outer bound also holds in an exponentially strong sense, meaning that for any rate vector outside the region, the probability of error converges to 1 exponentially fast. It appears that many existing arguments can be strengthened in this sense with relatively little effort, thereby proving exponentially strong results. We believe that this middle level deserves more focus than it has received so far, because exponentially strong converses and weak edge removal properties seem to hold for so many problems (at least under average probability of error). Therefore, one should always ask whether a given converse proof can be strengthened in this sense.

Several open problems remain:

The most important question is whether edge removal and strong converse properties hold in general. In particular, we know of no memoryless stationary network for which the weak edge removal property or the exponentially strong converse does not hold under average probability of error. The techniques of Sec. VII seem to allow one to prove a weak edge removal property (and thus an exponentially strong converse) for most (perhaps all) existing single-letter outer bounds, but there is no apparent way to do this without an existing single-letter result. Our observation that the properties hold for the discrete broadcast channel suggest that it may be possible to prove such results even for problems without known single-letter characterizations of the capacity region, but we know of no other cases for which this has been done. 2. 2.

Many of our results (particularly those showing that edge removal implies a strong converse) apply only for discrete channel coding problems; generalizing these results to continuous systems, channel cost constraints, source coding contexts, and random channel state would allow applicability to many other important network information theory problems. 3. 3.

We conjecture that an equivalence holds for discrete memoryless networks on the “lower layer” in Fig. 1, between very weak edge removal and the ordinary strong converse, but we have only been able to prove this result for deterministic networks and acyclic networks of independent point-to-point links. 4. 4.

Finally, it would be interesting to find a strong converse property equivalent to the extremely weak edge removal property.

Acknowledgements

The authors would like to thank Vincent Y. F. Tan, Michelle Effros, and Silas L. Fong for helpful discussions and feedback.

Appendix A Proof of Proposition 1

We will show that $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})\subseteq\mathcal{C}(\mathcal{N},(\tilde{\epsilon}_{n})_{n})$ ; the opposite direction follows by reversing the roles of $\epsilon_{n}$ and $\tilde{\epsilon}_{n}$ . Fix any rate vector

[TABLE]

We aim to show that $\mathbf{R}\in\mathcal{C}(\mathcal{N},(\tilde{\epsilon}_{n})_{n})$ . There exists $n_{0}\in\mathbb{N}$ such that for all $n\geq n_{0}$ , $\mathbf{R}\in\mathcal{R}(\mathcal{N},n,\epsilon_{n})$ . By the assumption of the lemma, there exists a subsequence $n_{i}$ such that

[TABLE]

For sufficiently large $i$ , we have $n_{i}\geq n_{0}$ , so $\mathbf{R}\in\mathcal{R}(\mathcal{N},n_{i},\epsilon_{n_{i}})$ . That is, there exists an $n_{i}$ -length code with rate $\mathbf{R}$ and probability of error at most $\epsilon_{n_{i}}$ . Fix integer $N$ , and form a new code on network $\mathcal{N}$ of length $n_{i}N$ and rate $\frac{N-2}{N}\mathbf{R}$ as follows. Roughly, reduce the overall probability of error by repeating the original code $N$ times, and introducing a small amount of error correction in the form of an outer maximum distance separable (MDS) code [49, Chap. 4]. In particular, for each node $v\in[1:d]$ where $R_{v}>0$ , form a $(N,N-2)$ MDS code on symbols from the finite field of order $2^{\lfloor n_{i}R_{v}\rfloor}$ . This code exists for sufficiently large $i$ (e.g., a Reed-Solomon code [49, Chap. 5]). Let the MDS codeword be denoted by $(W_{v}(1),\ldots,W_{v}(N))$ . Repeat the original code $N$ times, where on the $\ell$ th repetition $W_{v}(\ell)$ is treated as the message originating at node $v$ . Because each outer code is MDS, one error can be corrected, so if it most one of the $N$ repetitions results in an error, the full code will decode correctly. Because the network is memoryless and stationary, each repetition is independent and results in error with probability $\epsilon_{n_{i}}$ , so the probability of error for the full code is given by

[TABLE]

Note that (236) and the assumption that $\alpha>0$ imply that $\epsilon_{n_{i}}\to 1$ , meaning $1-\epsilon_{n_{i}}+N\epsilon_{n_{i}}\to N$ . Thus

[TABLE]

In particular, for sufficiently large $i$ , we have

[TABLE]

Hence, for any $N$ and sufficiently large $i$ ,

[TABLE]

Consider any blocklength $m$ where $n_{i}N\leq m\leq n_{i}(N+1)$ . We may convert a code with blocklength $n_{i}N$ to one with blocklength $m$ simply by ignoring the additional $m-n_{i}N$ symbols. This reduces the rate by a factor of $\frac{n_{i}N}{m}\geq\frac{N}{N+1}$ , but does not change the probability of error. Thus we have

[TABLE]

By the liminf assumption on $\tilde{\epsilon}_{n}$ in (13), for sufficiently large $m$ we have

[TABLE]

Thus, if $m\geq n_{i}N$ , we have

[TABLE]

where (245) holds by (244) for sufficiently large $i$ . Hence, for any $N$ , for all $m$ sufficiently large we have

[TABLE]

Thus

[TABLE]

Since (248) holds for all $N$ , and $\mathcal{C}(\mathcal{N},(\tilde{\epsilon}_{n})_{n})$ is closed, we have $\mathbf{R}\in\mathcal{C}(\mathcal{N},(\tilde{\epsilon}_{n})_{n})$ . Note that both $i$ and $N$ must go to infinity, but $i$ converges to infinity first for fixed $N$ in (240).

Appendix B Proof of Proposition 2

Extremely strong converse $\Leftrightarrow$ (1b): By taking $\gamma=K\alpha$ , the extremely strong converse holds if and only if, for any $\alpha\geq 0$ ,

[TABLE]

By Proposition 1, $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})=\mathcal{C}(\mathcal{N},(1-2^{-n\alpha})_{n})$ if $1-\epsilon_{n}\doteq 2^{-n\alpha}$ . This proves that the extremely strong converse is equivalent to the condition in (1b).

(1a) $\Rightarrow$ (1b). Consider any $\epsilon_{n}$ where $1-\epsilon_{n}\doteq 2^{-n\alpha}$ , and any $\mathbf{R}\in\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})$ . If $\mathbf{R}\in\mathcal{C}(\mathcal{N},0^{+})$ , then obviously $\mathbf{R}\in\mathcal{C}(\mathcal{N},0^{+})+[0,K\alpha]^{d}$ . If $\mathbf{R}\notin\mathcal{C}(\mathcal{N},0^{+})$ , then by condition (1a) we have $\alpha\geq\beta/K$ , and $\mathbf{R}\in\mathcal{C}(\mathcal{N},0^{+})+[0,\beta]^{d}$ . Thus $\mathbf{R}\in\mathcal{C}(\mathcal{N},0^{+})+[0,K\alpha]^{d}$ . This proves (1b).

(1b) $\Rightarrow$ (1a). Consider any $\mathbf{R}\notin\mathcal{C}(\mathcal{N},0^{+})$ , and any sequence of $(\mathbf{R},n)$ codes with probability of error $\epsilon_{n}$ . By Proposition 1, this implies $\mathbf{R}\in\mathcal{C}(\mathcal{N},(1-2^{-n\alpha})_{n})$ , where

[TABLE]

Hence, by condition (1b), $\mathbf{R}\in\mathcal{C}(\mathcal{N},0^{+})+[0,K\alpha]^{d}$ . If $\beta$ is the smallest number such that $\mathbf{R}\in\mathcal{C}(\mathcal{N},0^{+})+[0,\beta]^{d}$ , then we have $\beta\leq K\alpha$ . This proves (17), and hence (1c).

Exponentially strong converse $\Rightarrow$ (2b). Let $\epsilon_{n}$ be a sequence where $-\log(1-\epsilon_{n})=o(n)$ . By the exponentially strong converse, for any $\gamma>0$ there exists $\epsilon^{\prime}_{n}$ where $-\log(1-\epsilon^{\prime}_{n})=\Theta(n)$ where (16) holds. For sufficiently large $n$ , $-\log(1-\epsilon_{n})\leq-\log(1-\epsilon^{\prime}_{n})$ , meaning $\epsilon_{n}\leq\epsilon^{\prime}_{n}$ . Thus

[TABLE]

As this holds for all $\gamma>0$ , we have $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})\subseteq\mathcal{C}(\mathcal{N},0^{+})$ . This proves condition (2b).

(2b) $\Rightarrow$ Exponentially strong converse. Specifically, we prove that if the exponentially strong converse does not hold, then condition (2b) does not hold. Suppose there exist $\gamma>0$ such that for all $\epsilon_{n}$ where $-\log(1-\epsilon_{n})=\Theta(n)$ , $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})\not\subseteq\mathcal{C}(\mathcal{N},0^{+})+[0,\gamma]^{d}$ . Specifically, for any integer $r$ , $\mathcal{C}(\mathcal{N},(1-\exp\{-n/r\})_{n})\not\subseteq\mathcal{C}(\mathcal{N},0^{+})+[0,\gamma]^{d}$ . Since the sets $\mathcal{C}(\mathcal{N},(1-\exp\{-n/r\})_{n})$ are sorted (decreasing as $r$ grows), there exists $\mathbf{R}$ in the interior of $\mathcal{C}(\mathcal{N},(1-\exp\{-n/r\})_{n})$ for all integers $r$ such that $\mathbf{R}\notin\mathcal{C}(\mathcal{N},0^{+})$ . For all $r$ , there exists $n_{0}(r)$ such that for all $n\geq n_{0}(r)$ ,

[TABLE]

Define a sequence

[TABLE]

Note that $-\log(1-\epsilon_{n})\leq n/r$ for $n\geq n_{0}(r)$ , so $-\log(1-\epsilon_{n})=o(n)$ . Moreover, for any $n$ , there is some $r$ such that $n\geq n_{0}(r)$ and $\epsilon_{n}=1-\exp\{-n/r\}$ , so by (252), $\mathbf{R}\in\mathcal{R}(\mathcal{N},n,\epsilon_{n})$ for all $n$ . Thus $\mathbf{R}\in\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})$ . But since $\mathbf{R}\notin\mathcal{C}(\mathcal{N},0^{+})$ , (2b) does not hold.

(2a) $\Rightarrow$ (2b). By (2a), for any $\mathbf{R}\notin\mathcal{C}(\mathcal{N},0^{+})$ , the probability of correct decoding must vanish exponentially fast, so $\mathbf{R}\notin\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})$ for any sequence $\epsilon_{n}$ such that $-\log(1-\epsilon_{n})=o(n)$ . Therefore $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})\subseteq\mathcal{C}(\mathcal{N},0^{+})$ , which proves (2b).

(2b) $\Rightarrow$ (2a). For any $\mathbf{R}\notin\mathcal{C}(\mathcal{N},0^{+})$ and any sequence $\epsilon_{n}$ for which $\mathbf{R}\in\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})$ , it cannot be that $-\log(1-\epsilon_{n})=o(n)$ , or else by (2b) we would have $\mathbf{R}\in\mathcal{C}(\mathcal{N},0^{+})$ . Therefore $\epsilon_{n}$ must approach 1 exponentially fast, which proves (2a).

Strong converse $\Rightarrow$ (3b). Note that the condition in the definition of the strong converse that $-\log(1-\epsilon_{n})\to\infty$ can be more simply written as $\epsilon_{n}\to 1$ . Consider any $\epsilon\in(0,1)$ . By the strong converse, for any $\gamma>0$ , there exists a sequence $\epsilon_{n}\to 1$ where $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})\subseteq\mathcal{C}(\mathcal{N},0^{+})+[0,\gamma]^{d}$ . Noting that $\epsilon\leq\epsilon_{n}$ for sufficiently large $n$ , we have $\mathcal{C}(\mathcal{N},(\epsilon)_{n})\subseteq\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})\subseteq\mathcal{C}(\mathcal{N},0^{+})+[0,\gamma]^{d}$ . As this holds for all $\gamma>0$ , we have $\mathcal{C}(\mathcal{N},(\epsilon)_{n})=\mathcal{C}(\mathcal{N},0^{+})$ , which proves (3b).

(3b) $\Rightarrow$ (3c). By (3b), for any integer $r$ , $\mathcal{C}(\mathcal{N},(1-1/r)_{n})=\mathcal{C}(\mathcal{N},0^{+})$ . In particular, there exists $n_{0}(r)$ such that for all $n\geq n_{0}(r)$ ,

[TABLE]

Define a sequence

[TABLE]

Certainly $\epsilon_{n}\geq 1-1/r$ for $n\geq n_{0}(r)$ , meaning $\epsilon_{n}\to 1$ . Moreover, if $n,r$ are such that $\epsilon_{n}=1-\frac{1}{r}$ , then

[TABLE]

Since $1-\epsilon_{n}\to 0$ , we have

[TABLE]

This proves (3c).

(3c) $\Rightarrow$ Strong converse. By (3c), there exists a sequence $\epsilon_{n}\to 1$ where $\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})=\mathcal{C}(\mathcal{N},0^{+})\subseteq\mathcal{C}(\mathcal{N},0^{+})+[0,\gamma]^{d}$ for all $\gamma>0$ . This proves the strong converse.

(3c) $\Rightarrow$ (3a). By (3c), there exists $\epsilon_{n}\to 1$ where $\mathbf{R}\notin\mathcal{C}(\mathcal{N},(\epsilon_{n})_{n})$ for any $\mathbf{R}\notin\mathcal{C}(\mathcal{N},0^{+})$ . This implies that any sequence of $(\mathbf{R},n)$ codes must have probability of error exceeding $\epsilon_{n}$ for sufficiently large $n$ , so the probability of error must approach 1, which proves (3a).

(3a) $\Rightarrow$ (3b). For any $\epsilon\in(0,1)$ , by (3a) any $\mathbf{R}\notin\mathcal{C}(\mathcal{N},0^{+})$ has probability of error approaching 1, so $\mathbf{R}\notin\mathcal{C}(\mathcal{N},(\epsilon)_{n})$ . Therefore, $\mathcal{C}(\mathcal{N},(\epsilon)_{n})=\mathcal{C}(\mathcal{N},0^{+})$ , which proves (3b).

Appendix C Proof of Proposition 3

Consider a channel where (19) holds. For any $Q_{X,Y}$ , we may write

[TABLE]

where (261) follows from (19), and the fact that relative entropy is non-negative. Thus, we may lower bound $\alpha(R)$ by

[TABLE]

where (263) holds because $x+|y-x|^{+}\geq y$ for any real numbers $x,y$ . This lower bound is achievable by setting $Q_{X,Y}=P_{X}\times P_{Y|X}$ , where $P_{X}$ is any capacity-achieving input distribution, so indeed $\alpha(R)=R-C$ .

Now consider a channel where (19) does not hold. That is, there exists some $x_{0},y_{0}$ where

[TABLE]

Let $P_{X}$ be any capacity-achieving input distribution. Thus,

[TABLE]

In particular, there exists some $x_{1},y_{1}$ where

[TABLE]

and $P_{X}(x_{1})P_{Y|X}(y_{1}|x_{1})>0$ . For parameter $\lambda\geq 0$ , define a joint distribution $Q_{X,Y}^{(\lambda)}$ where

[TABLE]

As long as $0\leq\lambda\leq P_{X}(x_{1})P_{Y|X}(y_{1}|x_{1})$ , this is a valid distribution. If we marginalize out $X$ , we see that

[TABLE]

By [51, Lemma 17.3.3], the first term in the Taylor expansion for $D(Q_{Y}^{(\lambda)}\|P_{Y})$ around $\lambda=0$ is

[TABLE]

By [50, Cor. 1 in Sec. 4.5], $P_{Y}(y)>0$ for all $y$ that are reachable from some input symbol. Note that (264) implies that $P_{Y|X}(y_{0}|x_{0})>0$ , and also by assumption $P_{Y|X}(y_{1}|x_{1})>0$ . That is, both $y_{0}$ and $y_{1}$ are reachable output symbols, so $P_{Y}(y_{0}),P_{Y}(y_{1})>0$ . Thus in (269) the coefficient on $\lambda^{2}$ is finite, and so

[TABLE]

Noting that

[TABLE]

we have

[TABLE]

where we have used the assumptions in (264) and (266).

Applying the derivation in (258)–(260), we have

[TABLE]

where we have used (270), (272), and the fact that $\zeta$ is also the derivative of the second term in (274).

Given $\lambda$ small enough so that $Q^{(\lambda)}_{X,Y}$ is a valid distribution, we may upper bound

[TABLE]

Thus,

[TABLE]

where in (279) we have used the fact that $Q_{X,Y}^{(0)}=P_{X}\times P_{Y|X}$ , so $I_{Q^{(0)}_{X,Y}}(X;Y)=C$ ; and (280) follows from the definition of $\zeta$ in (272), as well as (275). Note also that this derivation is valid only because $\zeta>0$ , as shown in (272). Since $\alpha(R)$ is non-decreasing in $R$ , we must have $\frac{d\alpha(R)}{dR}\big{|}_{R=C}=0$ .

Appendix D Proof of Proposition 4

Statement 1 follows immediately from the definition of the strong edge removal property.

We now prove statement 2. Suppose the weak edge removal property holds. Thus, for any $\gamma>0$ , there exists a sequence $k_{n}=\Theta(n)$ satisfying (22). Let

[TABLE]

Note that $\delta^{\prime}$ , and so for any $0<\delta<\delta^{\prime}$ , we have $\delta n\leq k_{n}$ for sufficiently large $n$ . Thus

[TABLE]

Hence, the LHS of (24) is contained in $\mathcal{C}(\mathcal{N},0^{+})+[0,\gamma]^{d}$ . Since this holds for all $\gamma>0$ , this proves (24).

Now we show that (24) implies the weak edge removal property. For any $\gamma>0$ , by (24) there exists $\delta>0$ such that $\mathcal{C}(\mathcal{N},0^{+},(\delta n)_{n})=\mathcal{C}(\mathcal{N},0^{+})+[0,\gamma]^{d}$ . Thus, setting $k_{n}=\delta n$ satisfies (22). This proves the weak edge removal property.

To prove that the weak edge removal property is also equivalent to (25), we will show that

[TABLE]

To show $\subseteq$ in (283), we need to show that for all $k_{n}=o(n)$ , $\mathcal{C}(\mathcal{N},0^{+},(k_{n})_{n})$ is contained in the RHS of (283), or that $\mathcal{C}(\mathcal{N},0^{+},(k_{n})_{n})\subseteq\mathcal{C}(\mathcal{N},0^{+},(\delta n)_{n})$ for all $\delta>0$ . Indeed this holds because for any $k_{n}=o(n)$ and any $\delta>0$ , $k_{n}\leq\delta n$ for sufficiently large $n$ . To show $\supseteq$ in (283), let $\mathbf{R}$ be in the RHS of (283). Thus, for all $\epsilon,\delta,\gamma>0$ , for sufficiently large $n$ we have $\mathbf{R}\in\mathcal{R}(\mathcal{N},n,\epsilon,n\delta)+[0,\gamma]^{d}$ . In particular, for any fixed integer $r$ , we may let $\epsilon=\delta=\gamma=1/r$ , so there exists $n_{0}(r)$ such that for all $n\geq n_{0}(r)$ we have

[TABLE]

Let

[TABLE]

By (284), for any $n$ we have

[TABLE]

Letting $k_{n}=\frac{n}{r_{n}}$ , we may rewrite (286) as

[TABLE]

Note that for any integer $r$ , if $n\geq n_{0}(r)$ , then $r_{n}\geq r$ , so $k_{n}\leq n/r$ . Thus $k_{n}/n\to 0$ ; i.e., $k_{n}=o(n)$ . From (287), we have $\mathbf{R}\in\mathcal{C}(\mathcal{N},0^{+},(k_{n})_{n})$ . This proves $\supseteq$ in (283).

We now prove statement 3. Note that the very weak edge removal property is equivalent to the statement that for all $\gamma>0$ ,

[TABLE]

This is easily seen to be equivalent to (26).

To show that the very weak edge removal property is also equivalent to (27), we show that

[TABLE]

Noting that

[TABLE]

it is enough to show that for all $\epsilon>0$ ,

[TABLE]

For any $k\in\mathbb{N}$ and any sequence $k_{n}\to\infty$ , $k\leq k_{n}$ for sufficiently large $n$ . Thus

[TABLE]

Taking a closure yields $\supseteq$ in (291), since the LHS of (291) is already closed. To prove the opposite direction, let $\gamma_{k}$ be a positive sequence where $\lim_{k\to\infty}\gamma_{k}\to 0$ . For fixed $\epsilon\in(0,1)$ and $k\in\mathbb{N}$ , by the definition of $\mathcal{C}(\mathcal{N},(\epsilon)_{n},(k)_{n})$ in (21), there exists $n_{0}(k)$ such that for all $n\geq n_{0}(k)$ , we have

[TABLE]

Now define a sequence

[TABLE]

Note that for any $k\in\mathbb{N}$ , $k_{n}\geq k$ for all $n\geq n_{0}(k)$ , so $k_{n}\to\infty$ as $n\to\infty$ , because for any $k$ , $k_{n}\geq k$ for all $n\geq n_{0}(k)$ . Thus the LHS of (291) is contained in $\mathcal{C}(\mathcal{N},(\epsilon)_{n},(k_{n})_{n})$ . Moreover

[TABLE]

where (295) holds by definition, (296) follows from (293), (297) holds because $\gamma_{k}\to 0$ , and (298) holds because for any $n^{\prime}$ , $k_{n^{\prime}}$ is some integer. This proves $\subseteq$ in (291).

We now prove statement 4. The definition of the extremely weak edge removal property may be equivalently written

[TABLE]

Note that for any bounded $k_{n}$ , $\mathcal{C}(\mathcal{N},0^{+},(k_{n})_{n})\subseteq\mathcal{C}(\mathcal{N},0^{+},(k)_{n})$ for some constant integer $k$ . Thus the LHS (299) can be written

[TABLE]

Moreover, the RHS of (299) is simply $\mathcal{C}(\mathcal{N},0^{+})$ . Therefore the extremely weak edge removal property is equivalent to (28).

Appendix E Proof of Theorem 14

A significant technical tool in proving network equivalence (cf. see the discussion in Sec. VI, and the original result in [35]) is the idea of channel simulation, in which a point-to-point channel is accurately simulated by any other with higher capacity. This idea was at the heart of the proof in [35]. A version of this idea was stated in [53] as the universal channel simulation lemma, stated as follows. This lemma states that two nodes with shared randomness (represented by $U$ ) can use a noiseless link to accurately simulate a noisy channel, as long as the capacity of the noiseless link is greater than the capacity of the noisy channel. While [53] did not provide a proof, we presented a proof in the appendix of [54].

Lemma 16

Let $(\mathcal{X},Q_{Y|X},\mathcal{Y})$ be a discrete memoryless channel with capacity $C$ . Given a rate $R>C$ , a channel simulation code $(f,g)$ consists of

•

$f:\mathcal{X}^{n}\times[0,1]\to\{0,1\}^{nR}$ ,

•

$g:\{0,1\}^{nR}\times[0,1]\to\mathcal{Y}^{n}$ .

Let $P_{Y^{n}|X^{n}}$ be the conditional pmf of $Y^{n}$ given $X^{n}$ where $U\sim\text{Unif}[0,1]$ and

[TABLE]

There exists a sequence of length- $n$ simulation codes where

[TABLE]

We now proceed to prove Theorem 14. By Theorem 5, we only need to show that the very weak edge removal property implies the ordinary strong converse. The basic approach is to use network equivalence to convert a code for noisy network $\mathcal{N}$ into a code on the noiseless version, then apply Lemma 9 on this noiseless network, and then again use network equivalence to convert back to the noisy network.

Let $\mathcal{E}\subset[1:d]\times[1:d]$ be the set of pairs of nodes connected by point-to-point links. Recall that by assumption, the directed graph $([1:d],\mathcal{E})$ is acyclic. Thus, by [55, Prop. 19.1] we may assign each node $i$ a distinct integer $\pi_{i}\in[1:d]$ where $\pi_{i}<\pi_{j}$ if $(i,j)\in\mathcal{E}$ . For any $(i,j)\in\mathcal{E}$ , let $C_{i\to j}$ be the capacity of the link from $i$ to $j$ . Assume without loss of generality that $C_{i\to j}>0$ for all $(i,j)\in\mathcal{E}$ . Let $C_{\min}=\min_{(i,j)\in\mathcal{E}}C_{i\to j}$ , so in particular $C_{\min}>0$ . Denote $X_{i\to j}$ and $Y_{i\to j}$ as the input and output respectively of the link $(i,j)$ . Thus the transmitted symbol from node $i$ can be written

[TABLE]

and the received symbol at node $j$ can be written

[TABLE]

Let $\mathbf{R}$ be achievable with respect to fixed $\epsilon\in(0,1)$ . Thus, for sufficiently large $n$ , there exists a length- $n$ code for network $\mathcal{N}$ with rate $\mathbf{R}$ and probability of error $\epsilon$ . By (9)–(10), this code is defined by encoding functions $\phi_{it}$ for each node $i\in[1:d]$ and time $t\in[1:n]$ , and decoding functions $\psi_{i}$ for each node $i\in[1:d]$ . It will be useful to work with coding functions on $n$ -length blocks rather than single time instances, so we define the block-wise encoding function at node $i$

[TABLE]

as

[TABLE]

Using the notation in (304), we may notate the arguments to this function as

[TABLE]

Due to the network being acyclic, we may form a pipelined block-Markov version of this code as follows. Given integer $N$ , we form a code with length $n(N+d)$ and rate $\frac{N}{N+d}\mathbf{R}$ . The outer blocklength $N$ serves a similar function as it did for network stacking, but here it represents the number of message blocks transmitted subsequently, rather than the number of stacks. Note that message $i$ consists of $NnR_{i}$ bits, which we denote $W_{i}(1),\ldots,W_{i}(N)$ , each consisting of $nR_{i}$ bits. We then pipeline $N$ copies of the original code, encoding $n$ -length blocks at a time. In particular, we introduce notation

[TABLE]

Now, we define the coding operations at node $j$ by, for all $\ell\in[1:N]$ ,

[TABLE]

Recall that if $(i,j)\in\mathcal{E}$ , then $\pi_{i}<\pi_{j}$ , meaning that the arguments of $\phi_{j}^{n}$ in (310) are causally available. Note that (310) does not specify all channel inputs, namely $X_{j}^{n}(\ell^{\prime})$ for $\ell^{\prime}\in[1:\pi_{j}]\cup[N+\pi_{j}+1:N+d]$ ; these channel inputs can be arbitrary, as the corresponding channel outputs will be ignored. To decode at node $i$ , for all $\ell\in[1:N]$ let

[TABLE]

Observe that the variables associated with a given index $\ell\in[1:N]$ associate only with themselves, and behave exactly like the original $n$ -length code. Thus, an error occurs on this pipelined code if and only if any of the $N$ copies make an error, so the probability of error is

[TABLE]

Thus we have

[TABLE]

Note that in this pipelined code, encoding operations are performed on $n$ -length blocks at a time. Thus, the pipelined code on $\mathcal{N}$ can be converted to one on a deterministic network using channel simulation codes. In particular, fix $\Delta\in(0,C_{\min})$ and let $\bar{\mathcal{N}}_{\Delta}$ be the network of noiseless links where link $(i,j)$ is replaced by a noiseless link with capacity $C_{i\to j}+\Delta$ . By Lemma 16, for each link $(i,j)$ there exists a channel simulation code for link $(i,j)$ of rate $C_{i\to j}+\Delta$ and total variational distance at most $d^{(i\to j)}_{n}$ , where $d^{(i\to j)}_{n}\to 0$ as $n\to\infty$ . For each link $(i,j)\in\mathcal{E}$ , we use $N$ copies of the associated channel simulation code to simulate the behavior of link $(i,j)$ in network $\mathcal{N}$ using the corresponding link on $\bar{\mathcal{N}}_{\Delta}$ . We analyze the impact on the overall probability of error from replacing these noisy channels by channel simulation codes as follows. Let $P_{\mathbf{X},\mathbf{Y},\mathbf{W},\hat{\mathbf{W}}}$ by the joint distribution of all channel inputs $\mathbf{X}$ , channel outputs $\mathbf{Y}$ , messages $\mathbf{W}$ , and message estimates $\hat{\mathbf{W}}$ for the pipelined code on noisy network $\mathcal{N}$ . Similarly, let $Q_{\mathbf{X},\mathbf{Y},\mathbf{W},\hat{\mathbf{W}}}$ be the joint distribution of the same random variables on the code on noiseless network $\bar{\mathcal{N}}_{\Delta}$ constructed out of channel simulation codes. Note that in the latter, $\mathbf{X}$ and $\mathbf{Y}$ are not real channel inputs and outputs, but rather simulated inputs and outputs that feed into the channel simulation codes, used to simulate noisy links with noiseless links. Since each channel simulation code used on an $n$ -length block for link $(i,j)$ results in total variational distance at most $d^{(i\to j)}_{n}$ , we may bound

[TABLE]

The probability of error for the code on the noiseless network $\bar{\mathcal{N}}_{\Delta}$ differs from that on the original noisy network by at most the quantity in (314). Because total variational distance is an upper bound on the difference in the probability of any event between the two distributions, the probability of error of the resulting code on $\bar{\mathcal{N}}_{\Delta}$ is at most

[TABLE]

where the inequality holds for sufficiently large $n$ , since each sequence $d^{(i\to j)}_{n}$ vanishes with $n$ . Recall that the channel simulation codes described in Lemma 16 employ common randomness $U$ between the transmitter and receiver of each link. Thus, a direct application of Lemma 16 implies only the existence of a code achieving the probability in (315) if nodes are allowed common randomness. However, we may treat this common randomness as a randomized codebook, and employ a usual random coding argument to show that there exists at least one deterministic code achieving (315). Hence, for sufficiently large $n$ ,

[TABLE]

We now apply Lemma 9 on $\bar{\mathcal{N}}_{\Delta}$ , to find that for any $\tilde{\epsilon}>0$ and for sufficiently large $n$ , we have

[TABLE]

where $\eta(\tilde{\epsilon},d)$ is defined in (42).

Let $\bar{\mathcal{N}}_{-\Delta}$ be the noiseless network where each link $(i,j)$ is replaced by a noiseless one with capacity $C_{i\to j}-\Delta$ . By the assumption that $\Delta<C_{\min}$ , we always have $C_{i\to j}-\Delta>0$ . We may convert the code on $\bar{\mathcal{N}}_{\Delta}$ to one on $\bar{\mathcal{N}}_{-\Delta}$ by stretching each block of $n$ to one of length

[TABLE]

Thus

[TABLE]

Now we use ordinary noisy channel codes to convert this code back to one on $\mathcal{N}$ , again one block (now of length $n^{\prime}$ ) at a time. For any $N$ and sufficiently large $n$ , the probability of an error occurring on any of these channel codes can be made at most $\tilde{\epsilon}$ . Thus we have

[TABLE]

As the above holds for any $\tilde{\epsilon}>0$ , we may write

[TABLE]

Since we may take $N$ to be arbitrarily large, and $\Delta$ arbitrarily small, and we chose $\mathbf{R}$ to be any achievable vector with respect to $\epsilon$ , by closure we have

[TABLE]

By the equivalent form of the very weak edge removal property in (27) of Proposition 4, if very weak edge removal holds, then the RHS of (323) equals $\mathcal{C}(\mathcal{N},0^{+})$ , so the strong converse holds.

Bibliography55

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] T. Ho, M. Effros, and S. Jalali, “On equivalence between network topologies,” in Proc. Forty-Eighth Annual Allerton Conference , Monticello, IL, Oct. 2010.
2[2] S. Jalali, M. Effros, and T. Ho, “On the impact of a single edge on the network coding capacity,” in Proc. Information Theory and Applications Workshop (ITA) , San Diego, CA, Feb. 2011, pp. 1–5.
3[3] E. J. Lee, M. Langberg, and M. Effros, “Outer bounds and a functional study of the edge removal problem,” in Proc. IEEE Information Theory Workshop , Sevilla, Spain, Sep. 2013, pp. 1–5.
4[4] S. U. Kamath, D. N. C. Tse, and V. Anantharam, “Generalized network sharing outer bound and the two-unicast problem,” in Proc. International Symposium on Network Coding (Net Cod) , Beijing, China, Jul. 2011.
5[5] R. W. Yeung, “A framework for linear information inequalities,” IEEE Trans. Inf. Theory , vol. 43, no. 6, pp. 1924–1934, Nov. 1997.
6[6] M. Langberg and M. Effros, “Network coding: Is zero error always possible?” in Proc. Forty-Nine Annual Allerton Conference , Monticello, IL, Sep. 2011, pp. 1–8.
7[7] T. H. Chan and A. Grant, “Network coding capacity regions via entropy functions,” IEEE Trans. Inf. Theory , vol. 60, no. 9, pp. 5347–5374, Sept 2014.
8[8] M. F. Wong, M. Langberg, and M. Effros, “On a capacity equivalence between network and index coding and the edge removal problem,” in 2013 IEEE International Symposium on Information Theory , July 2013, pp. 972–976.