Error Exponents of Typical Random Trellis Codes

Neri Merhav

arXiv:1903.01120·cs.IT·March 5, 2019

Error Exponents of Typical Random Trellis Codes

Neri Merhav

PDF

TL;DR

This paper derives error exponents for typical random trellis codes over discrete memoryless channels, providing formulas that help identify good codes and error events, and extends results to channels with memory and mismatch.

Contribution

It introduces a Csiszar-style and Gallager-style error exponent analysis for typical random trellis codes, extending previous work on block codes to structured trellis codes.

Findings

01

Derived a Csiszar-style error exponent formula for trellis codes.

02

Established a Gallager-style error exponent related to the expurgated exponent.

03

Extended analysis to channels with memory and mismatch.

Abstract

In continuation to an earlier work, where error exponents of typical random codes were studied in the context of general block coding, with no underlying structure, here we carry out a parallel study on typical random, time-varying trellis codes for general discrete memoryless channels, focusing on a certain range of low rates. By analyzing an upper bound to the error probability of the typical random trellis code, using the method of types, we first derive a Csiszar-style error exponent formula (with respect to the constraint length), which allows to easily identify and characterize properties of good codes and dominant error events. We also derive a Gallager-style form of this error exponent, which turns out to be related to the expurgated error exponent. The main result is further extended to channels with memory and mismatch.

Equations124

\mbox{\boldmath$x$}_{t}=f_{t}(\mbox{\boldmath$u$}_{t},\mbox{\boldmath$u$}_{t-1},\ldots,\mbox{\boldmath$u$}_{t-k+1}),~{}~{}~{}~{}~{}~{}~{}t=1,2,\ldots

\mbox{\boldmath$x$}_{t}=f_{t}(\mbox{\boldmath$u$}_{t},\mbox{\boldmath$u$}_{t-1},\ldots,\mbox{\boldmath$u$}_{t-k+1}),~{}~{}~{}~{}~{}~{}~{}t=1,2,\ldots

\mbox P r {Y_{1} = y_{1}, Y_{2} = y_{2}, \dots, Y_{r} = y_{r} ∣ X_{1} = x_{1}, X_{2} = x_{2}, \dots, X_{r} = x_{r}} = t = 1 \prod r W (y_{t} ∣ x_{t}) .

\mbox P r {Y_{1} = y_{1}, Y_{2} = y_{2}, \dots, Y_{r} = y_{r} ∣ X_{1} = x_{1}, X_{2} = x_{2}, \dots, X_{r} = x_{r}} = t = 1 \prod r W (y_{t} ∣ x_{t}) .

f_{t}(\mbox{\boldmath$u$}_{t},\ldots,\mbox{\boldmath$u$}_{t-k+1})=\mbox{\boldmath$x$}_{0,t}\oplus\sum_{j=0}^{k-1}\mbox{\boldmath$u$}_{t-j}G_{j}(t),

f_{t}(\mbox{\boldmath$u$}_{t},\ldots,\mbox{\boldmath$u$}_{t-k+1})=\mbox{\boldmath$x$}_{0,t}\oplus\sum_{j=0}^{k-1}\mbox{\boldmath$u$}_{t-j}G_{j}(t),

{\cal E}_{\mbox{\tiny rtc}}(R,Q)=\liminf_{K\to\infty}\left\{-\frac{\log\mbox{\boldmath$E$}P_{\mbox{\tiny e}}}{K}\right\},

{\cal E}_{\mbox{\tiny rtc}}(R,Q)=\liminf_{K\to\infty}\left\{-\frac{\log\mbox{\boldmath$E$}P_{\mbox{\tiny e}}}{K}\right\},

{\cal E}_{\mbox{\tiny rtc}}(R,Q)\geq E_{\mbox{\tiny rtc}}(R,Q)\stackrel{{\scriptstyle\Delta}}{{=}}\left\{\begin{array}[]{ll}R_{0}(Q)/R&R<R_{0}(Q)\\ E_{0}(\rho_{\mbox{\tiny rtc}}(R),Q)/R&R>R_{0}(Q)\end{array}\right.

{\cal E}_{\mbox{\tiny rtc}}(R,Q)\geq E_{\mbox{\tiny rtc}}(R,Q)\stackrel{{\scriptstyle\Delta}}{{=}}\left\{\begin{array}[]{ll}R_{0}(Q)/R&R<R_{0}(Q)\\ E_{0}(\rho_{\mbox{\tiny rtc}}(R),Q)/R&R>R_{0}(Q)\end{array}\right.

E_{0} (ρ, Q) = - lo g y \sum [x \sum Q (x) W (y ∣ x)^{1/ (1 + ρ)}]^{1 + ρ},

E_{0} (ρ, Q) = - lo g y \sum [x \sum Q (x) W (y ∣ x)^{1/ (1 + ρ)}]^{1 + ρ},

E_{\mbox ce x} (R, Q) = \frac{E _{\mbox x} ( ρ _{\mbox ce x} ( R ) , Q )}{R},

E_{\mbox ce x} (R, Q) = \frac{E _{\mbox x} ( ρ _{\mbox ce x} ( R ) , Q )}{R},

E_{\mbox x} (ρ, Q) = - ρ lo g x, x^{'} \sum Q (x) Q (x^{'}) (y \sum W (y ∣ x) W (y ∣ x^{'}))^{1/ ρ} .

E_{\mbox x} (ρ, Q) = - ρ lo g x, x^{'} \sum Q (x) Q (x^{'}) (y \sum W (y ∣ x) W (y ∣ x^{'}))^{1/ ρ} .

(\frac{2 L}{1 - 2 ^{- ϵ / ρR}})^{ρ} \cdot exp {- K E_{\mbox ce x} (ρ, Q)},

(\frac{2 L}{1 - 2 ^{- ϵ / ρR}})^{ρ} \cdot exp {- K E_{\mbox ce x} (ρ, Q)},

{\cal E}_{\mbox{\tiny trtc}}(R,Q)\stackrel{{\scriptstyle\Delta}}{{=}}\liminf_{K\to\infty}\left\{-\frac{\mbox{\boldmath$E$}\log P_{\mbox{\tiny e}}}{K}\right\},

{\cal E}_{\mbox{\tiny trtc}}(R,Q)\stackrel{{\scriptstyle\Delta}}{{=}}\liminf_{K\to\infty}\left\{-\frac{\mbox{\boldmath$E$}\log P_{\mbox{\tiny e}}}{K}\right\},

E_{\mbox t r t c} (R, Q) \geq E_{\mbox t r t c} (R, Q) = Δ \frac{E _{\mbox x} ( ρ _{\mbox t r t c} ( R ) , Q )}{R},

E_{\mbox t r t c} (R, Q) \geq E_{\mbox t r t c} (R, Q) = Δ \frac{E _{\mbox x} ( ρ _{\mbox t r t c} ( R ) , Q )}{R},

R = \frac{E _{\mbox x} ( ρ , Q )}{2 ρ - 1} .

R = \frac{E _{\mbox x} ( ρ , Q )}{2 ρ - 1} .

E_{\mbox t r cc} (R, Q) \geq E_{\mbox ce x} (R, Q) .

E_{\mbox t r cc} (R, Q) \geq E_{\mbox ce x} (R, Q) .

\displaystyle\mbox{\boldmath$E$}\log P_{\mbox{\tiny e}}({\cal C}_{k})

\displaystyle\mbox{\boldmath$E$}\log P_{\mbox{\tiny e}}({\cal C}_{k})

\mbox{\boldmath$v$}_{j},\mbox{\boldmath$v$}_{j+1},\ldots,\mbox{\boldmath$v$}_{j+\ell},\mbox{\boldmath$u$}_{j+\ell+1},\mbox{\boldmath$u$}_{j+\ell+2},\ldots,\mbox{\boldmath$u$}_{j+\ell+k-1},

\mbox{\boldmath$v$}_{j},\mbox{\boldmath$v$}_{j+1},\ldots,\mbox{\boldmath$v$}_{j+\ell},\mbox{\boldmath$u$}_{j+\ell+1},\mbox{\boldmath$u$}_{j+\ell+2},\ldots,\mbox{\boldmath$u$}_{j+\ell+k-1},

P_{\mbox{\tiny e}}({\cal C}_{k})\leq\sum_{\ell\geq 1}\frac{1}{2^{m\ell}}\sum_{\mbox{\boldmath$x$}\in{\cal X}^{k+\ell}}\sum_{\mbox{\boldmath$x$}^{\prime}\in{\cal X}^{k+\ell}}\mbox{Pr}\left\{W(\mbox{\boldmath$y$}|\mbox{\boldmath$x$}^{\prime})\geq W(\mbox{\boldmath$y$}|\mbox{\boldmath$x$})\right\},

P_{\mbox{\tiny e}}({\cal C}_{k})\leq\sum_{\ell\geq 1}\frac{1}{2^{m\ell}}\sum_{\mbox{\boldmath$x$}\in{\cal X}^{k+\ell}}\sum_{\mbox{\boldmath$x$}^{\prime}\in{\cal X}^{k+\ell}}\mbox{Pr}\left\{W(\mbox{\boldmath$y$}|\mbox{\boldmath$x$}^{\prime})\geq W(\mbox{\boldmath$y$}|\mbox{\boldmath$x$})\right\},

\displaystyle\mbox{Pr}\left\{W(\mbox{\boldmath$y$}|\mbox{\boldmath$x$}^{\prime})\geq W(\mbox{\boldmath$y$}|\mbox{\boldmath$x$})\right\}

\displaystyle\mbox{Pr}\left\{W(\mbox{\boldmath$y$}|\mbox{\boldmath$x$}^{\prime})\geq W(\mbox{\boldmath$y$}|\mbox{\boldmath$x$})\right\}

d_{s} (x, x^{'}) = - lo g_{2} [y \sum W^{1 - s} (y ∣ x) W^{s} (y ∣ x^{'})],

d_{s} (x, x^{'}) = - lo g_{2} [y \sum W^{1 - s} (y ∣ x) W^{s} (y ∣ x^{'})],

P_{\mbox e} (C_{k}) \leq ℓ \geq 1 \sum 2^{- m ℓ} {\hat{P}_{X X^{'}}} \sum N_{ℓ} (\hat{P}_{X X^{'}}) \cdot exp {- n (k + ℓ) Δ (\hat{P}_{X X^{'}})},

P_{\mbox e} (C_{k}) \leq ℓ \geq 1 \sum 2^{- m ℓ} {\hat{P}_{X X^{'}}} \sum N_{ℓ} (\hat{P}_{X X^{'}}) \cdot exp {- n (k + ℓ) Δ (\hat{P}_{X X^{'}})},

D (\hat{P}_{X X^{'}} ∥ Q \times Q) = x, x^{'} \in X \sum \hat{P}_{X X^{'}} (x, x^{'}) lo g_{2} \frac{P ^ _{X X^{'}} ( x , x ^{'} )}{Q ( x ) Q ( x ^{'} )} .

D (\hat{P}_{X X^{'}} ∥ Q \times Q) = x, x^{'} \in X \sum \hat{P}_{X X^{'}} (x, x^{'}) lo g_{2} \frac{P ^ _{X X^{'}} ( x , x ^{'} )}{Q ( x ) Q ( x ^{'} )} .

\displaystyle\mbox{\boldmath$E$}\{N_{\ell}(\hat{P}_{XX^{\prime}})\}

\displaystyle\mbox{\boldmath$E$}\{N_{\ell}(\hat{P}_{XX^{\prime}})\}

\mbox{Pr}\{N_{\ell}(\hat{P}_{XX^{\prime}})\geq 1\}\leq\mbox{\boldmath$E$}\{N_{\ell}(\hat{P}_{XX^{\prime}})\}<(2^{m}-1)\cdot 2^{-n(k+\ell)\epsilon},

\mbox{Pr}\{N_{\ell}(\hat{P}_{XX^{\prime}})\geq 1\}\leq\mbox{\boldmath$E$}\{N_{\ell}(\hat{P}_{XX^{\prime}})\}<(2^{m}-1)\cdot 2^{-n(k+\ell)\epsilon},

\mbox{Pr}\{N_{\ell}(\hat{P}_{XX^{\prime}})>2^{n(k+\ell)\epsilon}\cdot\mbox{\boldmath$E$}\{N_{\ell}(\hat{P}_{XX^{\prime}})\}\leq 2^{-n(k+\ell)\epsilon}<(2^{m}-1)\cdot 2^{-n(k+\ell)\epsilon}.

\mbox{Pr}\{N_{\ell}(\hat{P}_{XX^{\prime}})>2^{n(k+\ell)\epsilon}\cdot\mbox{\boldmath$E$}\{N_{\ell}(\hat{P}_{XX^{\prime}})\}\leq 2^{-n(k+\ell)\epsilon}<(2^{m}-1)\cdot 2^{-n(k+\ell)\epsilon}.

\mbox P r {T_{k}^{\mbox c}}

\mbox P r {T_{k}^{\mbox c}}

\frac{J ^{2} lo g ( n ℓ + 1 )}{n ℓ} \leq \frac{J ^{2} lo g [ n ( k + 1 ) + 1 ]}{n ( k + 1 )} \leq \frac{ϵ}{2},

\frac{J ^{2} lo g ( n ℓ + 1 )}{n ℓ} \leq \frac{J ^{2} lo g [ n ( k + 1 ) + 1 ]}{n ( k + 1 )} \leq \frac{ϵ}{2},

S_{ℓ}^{'}

S_{ℓ}^{'}

P_{\mbox e} (C_{k})

P_{\mbox e} (C_{k})

S_{ℓ, i} = S_{ℓ} \cap {\hat{P}_{X X^{'}} \in P^{n (k + ℓ)} : R_{i - 1} \leq D (\hat{P}_{X X^{'}} ∥ Q \times Q) < R_{i}}, R_{i} = i ϵ, i = 1, 2, \dots, ⌈ 2 R / ϵ ⌉

S_{ℓ, i} = S_{ℓ} \cap {\hat{P}_{X X^{'}} \in P^{n (k + ℓ)} : R_{i - 1} \leq D (\hat{P}_{X X^{'}} ∥ Q \times Q) < R_{i}}, R_{i} = i ϵ, i = 1, 2, \dots, ⌈ 2 R / ϵ ⌉

ℓ \geq \frac{k ( R _{i - 1} - ϵ )}{2 R - R _{i - 1} + ϵ} = Δ k θ (R_{i - 1}) .

ℓ \geq \frac{k ( R _{i - 1} - ϵ )}{2 R - R _{i - 1} + ϵ} = Δ k θ (R_{i - 1}) .

P_{\mbox e} (C_{k})

P_{\mbox e} (C_{k})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Error Exponents of Typical Random Trellis Codes

††thanks: This research was supported by the Israel Science Foundation (ISF), grant no. 137/18.

Neri Merhav

Abstract

In continuation to an earlier work, where error exponents of typical random codes were studied in the context of general block coding, with no underlying structure, here we carry out a parallel study on typical random, time–varying trellis codes for general discrete memoryless channels, focusing on a certain range of low rates. By analyzing an upper bound to the error probability of the typical random trellis code, using the method of types, we first derive a Csiszár–style error exponent formula (with respect to the constraint length), which allows to easily identify and characterize properties of good codes and dominant error events. We also derive a Gallager–style form of this error exponent, which turns out to be related to the expurgated error exponent. The main result is further extended to channels with memory and mismatch.

Index Terms: trellis codes, convolutional codes, typical error exponent, constraint length, expurgated bound, mismatch, channels with memory.

The Andrew & Erna Viterbi Faculty of Electrical Engineering

Technion - Israel Institute of Technology

Technion City, Haifa 32000, ISRAEL

E–mail: [email protected]

1 Introduction

Following the work of Barg and Forney [1], Nazari [13] and Nazari et al. [14], in a recent work [11], the error exponent of the typical random block code for a general discrete memoryless channel (DMC) was studied. The error exponent of the typical random code (TRC) was defined as the long–block limit of the negative normalized expectation of the logarithm of the error probability, as opposed to the classical random coding exponent, defined as the negative normalized logarithm of the expectation of the error probability. The investigation of error exponents for TRCs was motivated in [11, Introduction] by a few points: (i) Owing to Jensen’s inequality, it cannot be smaller than the random coding error exponent, and so, it is a more optimistic performance measure than the ordinary random coding exponent, especially at low rates. (ii) Given that a certain measure concentration property holds, it is more relevant as a performance metric, since the code is normally assumed to be randomly selected just once, and then used repeatedly. (iii) It captures correctly the behavior of random–like codes [2], which are well known to be very good codes.

In [11], an exact single–letter expression was derived for the error exponent function of the TRC assuming a general discrete memoryless channel (DMC) and an ensemble of fixed composition codes. Among other things, it was shown in [11] (similarly as in [1] and [13]), that the TRC error exponent is: (i) the same as the expurgated exponent at zero rate, (ii) below the expurgated exponent, but above the random coding exponent for low positive rates, and (iii) the same as the random coding exponent beyond a certain rate.

In view of the practical importance and the rich literature on trellis codes, and convolutional codes in particular (see, e.g., [3], [7], [8], [9], [10], [15], [16], [17], [18] just to name a few, as well as and many references therein), the purpose of this paper is to study the behavior and the performance of typical random trellis codes. More specifically, our aim is at an investigation parallel to that of [11], in the realm of ensembles of time–varying trellis codes. The main motivation is to compare the error exponent of the typical random trellis code to that of the typical block code on the basis of similar decoding complexity, in the spirit of the similar comparison in [17, Chap. 5], which was carried out for the ordinary random coding exponents of the two classes of codes. Technically speaking, our main result is that the error exponent of the typical random, time–varying trellis code is lower bounded by a certain expression that is related to the expurgated exponent, and its value lies between those of the convolutional random coding error exponent and the convolutional–coding expurgated exponent functions [16], [17, Sect. 5]. For the subclass of linear trellis codes, namely, time–varying convolutional codes, the result is improved: the typical time–varying convolutional code achieves the convolutional–coding expurgated exponent, provided that the channel is binary–input, output–symmetric (see also [16]). In other words, in the limit of large constraint length, a randomly selected time–varying convolutional code achieves the convolutional expurgated exponent with an overwhelmingly high probability. This is parallel to a similar behavior in the context of ordinary random block codes (without structure), where the error exponent of the typical random code is inferior to the corresponding expurgated exponent, and superior to the random coding error exponent (at low rates), but when it comes to linear random codes, the typical–code error exponent coincides with the expurgated exponent.

These results both sharpen and generalize some earlier statements on the fraction of time–varying (or periodically time–varying) convolutional codes with certain properties (see, for example, [9, Lemma 3.33, Lemma 4.15]), and in particular, the fact that (at least) half of the convolutional codes achieve the convolutional coding exponent [16, Theorem]. Beyond this, our contributions are in several aspects.

Our analysis provides a fairly clear insight on the behavior of the typical codes, i.e., their free distances and their distance enumerators. 2. 2.

Thanks to the use the method of types, we are able to characterize the dominant error events, that is, typical lengths of error bursts and joint types of incorrect trellis paths together with the correct path, which are even more informative than distances. 3. 3.

Our analysis is considerably general: we address general trellis codes (not merely convolutional codes) with a general random coding distribution (not necessarily the uniform distribution) and a general discrete memoryless channel (DMC), not merely binary–input, output–symmetric channels. 4. 4.

We further extend the results in two directions simultaneously, allowing both channels with input memory and mismatch.

The outline of the remaining part of this paper is the following. In Section 2, we establish notation conventions, define the problem setting, provide some background, and spell out the objectives of the paper more formally. In Section 3, we state the main result, and in Section 4 we prove it. Section 5 is devoted to some discussion, and finally, in Section 6, we extend the main result to channels with memory and mismatch.

2 Notation, Problem Setting, Background and Objectives

2.1 Notation

Throughout the paper, random variables will be denoted by capital letters, specific values they may take will be denoted by the corresponding lower case letters, and their alphabets will be denoted by calligraphic letters. Random vectors and their realizations will be denoted, respectively, by capital letters and the corresponding lower case letters, both in the bold face font. Their alphabets will be superscripted by their dimensions. For example, the random vector $\mbox{\boldmath$ X $}=(X_{1},\ldots,X_{r})$ , ( $r$ – positive integer) may take a specific vector value $\mbox{\boldmath$ x $}=(x_{1},\ldots,x_{r})$ in ${\cal X}^{r}$ , the $r$ –th order Cartesian power of ${\cal X}$ , which is the alphabet of each component of this vector. The probability of an event ${\cal E}$ will be denoted by $\mbox{Pr}\{{\cal E}\}$ , and the expectation operator will be denoted by $\mbox{\boldmath$ E $}\{\cdot\}$ . For two positive sequences $\{a_{k}\}$ and $\{b_{k}\}$ , the notation $a_{k}\stackrel{{\scriptstyle\cdot}}{{=}}b_{k}$ will stand for equality in the exponential scale, that is, $\lim_{k\to\infty}\frac{1}{k}\log\frac{a_{k}}{b_{k}}=0$ . Similarly, $a_{k}\stackrel{{\scriptstyle\cdot}}{{\leq}}b_{k}$ means that $\limsup_{k\to\infty}\frac{1}{k}\log\frac{a_{k}}{b_{k}}\leq 0$ , and so on. The indicator function of an event ${\cal E}$ will be denoted by ${\cal I}\{E\}$ .

The empirical distribution of a string of symbols in a finite alphabet ${\cal X}$ , denoted by $\hat{P}_{X}$ , is the vector of relative frequencies $\hat{P}_{X}(x)$ of each symbol $x\in{\cal X}$ along the string. Here $X$ denotes an auxiliary random variable (RV) distributed according to this distribution. Information measures associated with empirical distributions will be denoted with ‘hats’. For example, the entropy associated with the empirical distribution $\hat{P}_{X}$ , namely, the empirical entropy, will be denoted by $\hat{H}(X)$ . Similar conventions will apply to the joint empirical distribution, the joint type class, the conditional empirical distributions and the conditional type classes associated with pairs (and multiples) of sequences of length $r$ . Accordingly, $\hat{P}_{XX^{\prime}}$ will be the joint empirical distribution associated with a pair of strings of the same length, $\hat{H}(X,X^{\prime})$ will designate the empirical joint entropy, and $\hat{H}(X|X^{\prime})$ will be the empirical conditional entropy.

2.2 Problem Setting

Consider the system configuration depicted in Fig. 1. Let the information source, $U_{1},U_{2},\ldots$ , be the binary symmetric source (BSS), i.e., an infinite sequence of binary random variables taking on values in ${\cal U}=\{0,1\}$ , independently of each other, and with equal probabilities for ‘0’ and ‘1’. We shall group the bits of this information source in blocks of length $m$ , and denote $\mbox{\boldmath$ U $}_{t}=(U_{m(t-1)+1},U_{m(t-1)+2},\ldots,U_{mt})$ , $\mbox{\boldmath$ U $}_{t}\in{\cal U}^{m}$ , $t=1,2,\ldots$ .

A time–varying trellis code of rate $R=m/n$ and with memory size $k$ , is a sequence of functions $f_{1},f_{2},\ldots$ , $f_{t}:{\cal U}^{mk}\to{\cal X}^{n}$ , $t=1,2,\ldots$ , where ${\cal X}$ is the finite channel input alphabet of size $J$ . When fed with an input information sequence, $\mbox{\boldmath$ u $}_{1},\mbox{\boldmath$ u $}_{2},\ldots$ , which is a realization of $\mbox{\boldmath$ U $}_{1},\mbox{\boldmath$ U $}_{2},\ldots$ , the time–varying trellis codes outputs a code sequence, $\mbox{\boldmath$ x $}_{1},\mbox{\boldmath$ x $}_{2},\ldots$ , according to

[TABLE]

The product $mk$ designates the constraint length of the trellis code, and it will henceforth be denoted by $K$ . As is well known, a trellis code is a special case of a finite–state encoder whose total number of states is $2^{K}$ . On the other hand, a convolutional code is a special case of a trellis code where $\{f_{t}\}$ are linear functions over the relevant field.

A discrete memoryless channel (DMC) $W$ is defined by a set of single–letter conditional probabilities (or probability density functions), $\{W(y|x),~{}x\in{\cal X},~{}y\in{\cal Y}\}$ , where ${\cal X}$ is as before and ${\cal Y}$ is the channel output alphabet, which may be discrete or continuous.111Throughout the sequel, we will treat ${\cal Y}$ as a discrete alphabet, with the understanding that in the continuous case, all summations over ${\cal Y}$ should be replaced by integrals. When the channel is fed by a sequence, $x_{1},x_{2},\ldots$ , $x_{t}\in{\cal X}$ , $t=1,2,\ldots$ (a realization of a random process, $X_{1},X_{2},\ldots$ ), it responds by generating a corresponding output sequence, $y_{1},y_{2},\ldots$ , $y_{t}\in{\cal Y}$ , $t=1,2,\ldots$ (a realization of a random process, $Y_{1},Y_{2},\ldots$ ), according to

[TABLE]

As customary, we assume that the trellis code is decoded in long blocks using the maximum–likelihood (ML) decoder, which is implementable by the Viterbi algorithm, and by terminating each block with $m(k-1)$ zero input bits in order to reset the state of the encoder. As mentioned earlier, we also extend the results to channels with input memory (inter-symbol interference) along with mismatched decoding metrics, which are still implementable by the Viterbi Algorithm.

We consider the ensemble of time–varying trellis codes where for every $t=1,2,\ldots$ and every possible value of $(\mbox{\boldmath$ u $}_{t},\mbox{\boldmath$ u $}_{t-1},\ldots,\mbox{\boldmath$ u $}_{t-k+1})\in{\cal U}^{K})$ , the value of $f_{t}(\mbox{\boldmath$ u $}_{t},\mbox{\boldmath$ u $}_{t-1},\ldots,\mbox{\boldmath$ u $}_{t-k+1})\in{\cal X}^{n}$ is selected independently at random under the i.i.d. distribution $Q^{n}$ , namely, each one of the $n$ components of $f_{t}(\mbox{\boldmath$ u $}_{t},\mbox{\boldmath$ u $}_{t-1},\ldots,u_{t-k+1})\in{\cal X}^{n}$ is randomly drawn independently under a fixed distribution $Q$ over ${\cal X}$ . For the case of time–varying convolutional codes, the symbols $\{x_{t}\}$ are assumed binary ( $J=2$ ), and $\{f_{t}\}$ are assumed linear functions over $\mbox{GF}(2)$ , namely,

[TABLE]

where $\{\mbox{\boldmath$ u $}_{t-j}\}$ are considered row–vectors of dimension $m$ , $\{\mbox{\boldmath$ x $}_{0,t}\}$ are binary vectors of dimension $n$ , $\{G_{j}(t)\}$ are binary $m\times n$ matrices, the operations $\oplus$ and $\sum$ both designate summations modulo 2, and the channel is assumed binary–input, output–symmetric. The entries of $\{\mbox{\boldmath$ x $}_{0,t}\}$ and $\{G_{j}(t)\}$ are randomly and independently selected with equal probabilities of [math] and $1$ .

2.3 Background

The traditional ensemble performance metric is the exponential decay rate (as a function of $K$ ) of the expectation of the first–error event probability, or the per–node error probability [17, p. 243], as well as the related bit error probability,

[TABLE]

where the subscript “rtc” stands for “random trellis code” and accordingly, the expectation is w.r.t. the randomness of the time–varying trellis code, see, e.g., [17, Chap. 5]. As shown in [17, Sect. 5.1], the result for random time–varying convolutional codes, which easily extends to random time—varying trellis codes, is that this error exponent is essentially222The actual exponent is slightly smaller than that, but by an amount $\epsilon$ that can be made arbitrarily small. Here and in the sequel, we will ignore this very small loss. given by

[TABLE]

where $\rho_{\mbox{\tiny rtc}}(R)$ is the solution $\rho$ the equation $R=E_{0}(\rho,Q)/\rho$ , $E_{0}(\rho,Q)$ being the Gallager function,

[TABLE]

and $R_{0}(Q)=E_{0}(1,Q)$ . The best result is obtained, of course, upon maximizing over $Q$ , in which case, for $R>R_{0}=\max_{Q}R_{0}(Q)$ , the resulting error exponent is the best achievable error exponent, as it meets the converse bound of [17, Theorem 5.4.1].333Although the converse bound in [17, Sect. 5.4] is proved with convolutional codes in mind, the linearity of convolutional codes is not really used there, and so the very same proof applies also to non–linear trellis codes. It follows then that there is room for improvement only for rates below $R_{0}$ . Indeed, an improvement in this range is accomplished, for binary–input, output symmetric channels [17, p. 86], by an expurgated bound, derived in [16], [17, Sect. 5.3], and given by

[TABLE]

where $\rho_{\mbox{\tiny cex}}(R)$ is the solution $\rho\geq 1$ to the equation $R=E_{\mbox{\tiny x}}(\rho,Q)/\rho$ , with $E_{\mbox{\tiny x}}(\rho,Q)$ being defined as

[TABLE]

More precisely, in [16] the main theorem asserts that for at least half of the rate– $1/n$ time–varying convolutional codes, the probability of error does not exceed

[TABLE]

where $Q$ is the binary symmetric source (which in our notation, means the uniform distribution over the binary alphabet ${\cal X}$ ), $L$ is the block length, $\epsilon=E_{\mbox{\tiny x}}(\rho,Q)-\rho R>0$ is an arbitrarily small positive real, and $\rho\geq 1$ is any number that satisfies $R=[E_{\mbox{\tiny x}}(\rho,Q)-\epsilon]/\rho<R_{0}$ . It is clear from the proof of this theorem that choosing to refer to exactly half of the codes is quite arbitrary, and a similar bound, with the same exponential rate (assuming that $L$ is sub–exponential in $K$ ), would apply to any, arbitrarily large, fraction of the codes, at the expense of increasing the pre–exponential factor of (9) accordingly. For example, if the factor $2L$ at the numerator of the pre–exponent of (9) is replaced by $100L$ , then the bound would apply to at least 99% of the time–varying convolutional codes with block length $L$ , and so on. This indicates that the ensemble of convolutional codes obeys a measure concentration property concerning their error exponent.444As mentioned in the Introduction, several assertions in the same spirit can be found also in [9], see for example, Lemmas 3.33 and 4.15 therein.

2.4 Objectives

The purpose of this work is to study the above mentioned measure concentration property in a systematic manner and to broaden the scope in several directions at the same time, as will be specified shortly. In this context, similarly as in [11], we refer to the error exponent of typical random trellis code, and as discussed in [11, Introduction], if the ensemble of codes possesses the relevant measure concentration property associated with exponential error bounds, then the error exponent of the the typical random trellis code, is captured by the quantity

[TABLE]

which is similar to the above definition of ${\cal E}_{\mbox{\tiny rtc}}(R,Q)$ , except that the expectation operator and the logarithmic function are commuted. It will be understood that the limit of $K=mk\to\infty$ will be taken under the regime where $m$ and $n$ (and hence also $R=m/n$ ) are held fixed whereas $k\to\infty$ . A similar definition will apply to the smaller ensemble to time–varying convolutional codes and it will be denoted by ${\cal E}_{\mbox{\tiny trcc}}(R,Q)$ , where the subscript stands for typical random convolutional code.

3 Main Result

Our main theorem has two parts, where the second part actually follows directly from [16] (as discussed in Subsection 2.3) and is included here for completeness.

Theorem 1

Consider the problem setting defined in Subsection 2.2. Then, for $R<R_{0}(Q)$ ,

(a)

[TABLE]

where $\rho_{\mbox{\tiny trtc}}(R)$ is the solution, $\rho\geq 1$ , to the equation

[TABLE] 2. (b)

For the ensemble of time–varying convolutional codes and the binary–input output symmetric channel (with $Q(0)=Q(1)=\frac{1}{2}$ ),

[TABLE]

We emphasize that here the setup is considerably extended relative to that of [16], especially in part (a). This extension takes place in several dimensions at the same time:

Allowing general rational coding rates, $R=m/n$ , rather than $R=1/n$ . 2. 2.

Using ensembles with a general random coding distribution $Q$ , instead of just the uniform distribution. In this case, assertions about fractions of codes with certain properties are replaced by parallel assertions concerning (high) probabilities of possessing these properties. 3. 3.

Assuming a general DMC, not necessarily a binary–input, output symmetric channel. 4. 4.

As was mentioned already, we are referring to general trellis codes, as an extension to convolutional codes, which are linear. 5. 5.

A further extension is for mismatched decoding for a channel with input memory.

Furthermore, our analysis, which is strongly based on the method of types, will provide some insights on the character of two ingredients of interest:

Structure and distance enumeration (or more generally, type class enumeration) of the typical random trellis code, that achieves the convolutional coding expurgated exponent. 2. 2.

Error events that dominate the error probability: joint types of decoded trellis paths and the correct paths, along with the lengths of the typical error bursts.

These points, among others, will be discussed in mode detail in Section 5.

4 Proof of Theorem 1

Here we prove part (a) only, because part (b) can be obtained in a very similar manner by a small modification in a few places. Also, as discussed in Subsection 2.3, part (b) was actually proved already in [16] (at least for rate– $1/n$ codes, but the extension to $m/n$ –codes is not difficult).

Clearly, in order to derive a bound on ${\cal E}_{\mbox{\tiny trtc}}(R,Q)$ , we have to assess $\mbox{\boldmath$ E $}\log P_{\mbox{\tiny e}}({\cal C}_{k})$ , where ${\cal C}_{k}$ designates a randomly selected trellis code with memory $k$ (and constraint length $K=mk$ ) in the ensemble described in Subsection 2.2. Our first observation is the following: suppose we can define, for every $k\geq 1$ , a subset ${\cal T}_{k}$ of codes $\{{\cal C}_{k}\}$ whose probability, $1-\epsilon_{k}\stackrel{{\scriptstyle\Delta}}{{=}}\mbox{Pr}\{{\cal T}_{k}\}$ , tends to unity as $k\to\infty$ . Then,

[TABLE]

Thus, if we can define a subset of codes ${\cal T}_{k}$ , which on the one hand, has very high probability, and on the other hand, there is a uniform upper bound on $P_{\mbox{\tiny e}}({\cal C}_{k})$ for every ${\cal C}_{k}\in{\cal T}_{k}$ , this would yield a lower bound on the error exponent of the typical random trellis code. We will use this simple observation shortly after we define the subset ${\cal T}_{k}$ .

As mentioned earlier, we are assuming that each transmitted block is terminated by $k-1$ all–zero input vectors (each of dimension $m$ ) in order to reset the state of the shift register of the trellis encoder. Similarly as in linear convolutional codes, here too, every incorrect path $\{\mbox{\boldmath$ v $}_{t}\}$ , diverging from the correct path, $\{\mbox{\boldmath$ u $}_{t}\}$ , at a given node $j$ and re-merging with the correct path exactly after $k+\ell$ branches, must have the form

[TABLE]

where $\mbox{\boldmath$ v $}_{j}$ and $\mbox{\boldmath$ v $}_{j+\ell}$ can be any one of the $2^{m}-1$ incorrect input $m$ –vectors at nodes $j$ and $j+\ell$ , respectively. Between $j$ and $j+\ell$ there should be no sub-strings of $k-1$ consecutive correct inputs. Thus, overall there are no more than $(2^{m}-1)2^{m\ell}$ such incorrect paths [17, p. 311]. Following a similar555Note that here, unlike in [17], in part (a) of Theorem 1, we are considering general trellis codes, not convolutional codes, which are linear. Therefore, we cannot assume, without loss of generality, that the all–zero message was sent, but rather average over all input messages. In part (b), on the other hand, this averaging is not needed. This difference causes certain modifications in the analysis, which yield eventually $E_{\mbox{\tiny cex}}(R,Q)$ . line of thought as in the derivations of [17], for a given trellis code ${\cal C}_{k}$ , the probability of an error event beginning at any given node is upper bounded by

[TABLE]

where $x$ designates the codeword associated with the correct path and $\mbox{\boldmath$ x $}^{\prime}$ stands for any incorrect path diverging from the correct path at node $j$ and re-merging at $j+k+\ell$ . Since $x$ and $\mbox{\boldmath$ x $}^{\prime}$ may disagree at no more than $n(k+\ell)$ channel uses, the summand is actually the pairwise error probability associated with two vectors of length $n(k+\ell)$ , and it depends only on the joint empirical distribution of these two $n(k+\ell)$ –vectors, which we denote by $\hat{P}_{XX^{\prime}}$ . In particular, by the Chernoff bound, it is readily seen that for a given pair $(\mbox{\boldmath$ x $},\mbox{\boldmath$ x $}^{\prime})$ ,

[TABLE]

where

[TABLE]

is the Chernoff distance between $x$ and $x^{\prime}$ . It follows then that

[TABLE]

where $N_{\ell}(\hat{P}_{XX^{\prime}})$ is the number of pairs $\{(\mbox{\boldmath$ x $},\mbox{\boldmath$ x $}^{\prime})\}\in{\cal X}^{2n(k+\ell)}$ having joint empirical distribution that is given by $\hat{P}_{XX^{\prime}}$ . Here, the inner summation over $\{\hat{P}_{XX^{\prime}}\}$ is defined over the set ${\cal P}^{n(k+\ell)}$ of all possible empirical distributions of pairs of vectors in ${\cal X}^{n(k+\ell)}$ . For a given joint empirical distribution $\hat{P}_{XX^{\prime}}$ , we denote

[TABLE]

We note that

[TABLE]

We now define ${\cal T}_{k}$ as the subset of codes, henceforth referred to as the typical trellis codes, with the following property for a given arbitrarily small $\epsilon>0$ : for every $\ell\geq 1$ and every empirical joint distribution $\hat{P}_{XX^{\prime}}$ derived from $n(k+\ell)$ –vectors:

•

$N_{\ell}(\hat{P}_{XX^{\prime}})=0$ whenever $\mbox{\boldmath$ E $}\{N_{\ell}(\hat{P}_{XX^{\prime}})\}<(2^{m}-1)\cdot 2^{-n(k+\ell)\epsilon}$ , and

•

$N_{\ell}(\hat{P}_{XX^{\prime}})\leq 2^{n(k+\ell)\epsilon}\cdot\mbox{\boldmath$ E $}\{N_{\ell}(\hat{P}_{XX^{\prime}})\}$ whenever $\mbox{\boldmath$ E $}\{N_{\ell}(\hat{P}_{XX^{\prime}})\}\geq(2^{m}-1)\cdot 2^{-n(k+\ell)\epsilon}$ .

Obviously, by the Markov inequality, for every $\ell$ and $\hat{P}_{XX^{\prime}}$ in the first category, we have

[TABLE]

and similarly, for $\ell$ and $\hat{P}_{XX^{\prime}}$ in the second category, we have

[TABLE]

It follows by the union bound that

[TABLE]

The sequence $\{\frac{\log(n\ell+1)}{n\ell}\}$ is monotonically decreasing and so, since $\ell\geq k+1$ , we have, for large enough $k$ ,

[TABLE]

and then the last line of (23) cannot exceed the sum of the geometric series, $(2^{m}-1)\cdot 2^{-n(k+1)\epsilon/2}/(1-2^{-n\epsilon/2})$ , which tends to zero as $k\to\infty$ . Thus, $\mbox{Pr}\{{\cal T}_{k}\}$ tends to unity as $k\to\infty$ . Denoting

[TABLE]

it now follows that for every typical trellis code, ${\cal C}_{k}\in{\cal T}_{k}$ ,

[TABLE]

In order to address this summation over ${\cal S}_{\ell}$ , let us partition it as the disjoint union of the subsets

[TABLE]

and observe that for a given $i$ , ${\cal S}_{\ell,i}$ is non–empty only when $2\ell\geq(k+\ell)(R_{i-1}-\epsilon)/R$ , or equivalently,

[TABLE]

Then,

[TABLE]

where we have defined

[TABLE]

Now observe that

[TABLE]

where the commutation of the minimization and the maximization is allowed by convexity–concavity of the objective, and the final minimization over $s$ is achieved by $s=1/2$ due to the convexity and the symmetry of the function $\sum_{x,x^{\prime},y}Q(x)Q(x^{\prime})W^{s}(y|x)W^{1-s}(y|x^{\prime})$ around $s=1/2$ . Thus, the series in the last line of (27) is convergent as long as $R<R_{0}(Q)-2\epsilon$ , and its exponential order as a function of $K$ (ignoring $\epsilon$ –terms) is given by

[TABLE]

Thus, we have shown that the typical random trellis code error exponent is lower bounded by

[TABLE]

We next show that this expression is equivalent to the one asserted in part (a) of Theorem 1. First, observe that since $\Delta_{s}(P_{XX^{\prime}})$ is a linear functional of $P_{XX^{\prime}}$ , then $\Delta(P_{XX^{\prime}})=\max_{0\leq s\leq 1}\Delta(P_{XX^{\prime}})$ is convex in $P_{XX^{\prime}}$ . We argue that the minimizer, $P_{XX^{\prime}}^{*}$ , of $\Delta(P_{XX^{\prime}})$ within the set $\{P_{XX^{\prime}}:~{}D(P_{XX^{\prime}}\|Q\times Q)\leq 2\hat{R}\}$ must be a symmetric distribution, namely, $P_{XX^{\prime}}^{*}(x,x^{\prime})=P_{XX^{\prime}}^{*}(x^{\prime},x)$ for all $x,x^{\prime}\in{\cal X}$ . To see why this is true, given any $P_{XX^{\prime}}$ that satisfies the divergence constraint, define its transpose, $\tilde{P}_{XX^{\prime}}$ by $\tilde{P}_{XX^{\prime}}(x,x^{\prime})=P_{XX^{\prime}}(x^{\prime},x)$ for all $x,x^{\prime}\in{\cal X}$ . Obviously, $\Delta(\tilde{P}_{XX^{\prime}})=\Delta(P_{XX^{\prime}})$ because if $s^{*}$ achieves $\Delta(P_{XX^{\prime}})$ , then $1-s^{*}$ achieves $\Delta(\tilde{P}_{XX^{\prime}})$ and the value of the maximum is the same (just by swapping $x$ and $x^{\prime}$ ). Next, define $\bar{P}_{XX^{\prime}}=\frac{1}{2}P_{XX^{\prime}}+\frac{1}{2}\tilde{P}_{XX^{\prime}}$ . Then,

[TABLE]

and at the same time,

[TABLE]

so the divergence constraint is satisfied. It follows then that the symmetric distribution $\bar{P}_{XX^{\prime}}$ is never worse than $P_{XX^{\prime}}$ in terms of minimizing $\Delta(\cdot)$ under the divergence constraint. Thus, it is sufficient to seek the minimizing $P_{XX^{\prime}}$ among the symmetric distributions. However, given that $P_{XX^{\prime}}$ is symmetric, the maximizing $s$ is $s^{*}=1/2$ , because then $\Delta_{1-s}(P_{XX^{\prime}})=\Delta_{s}(P_{XX^{\prime}})$ . Thus, the r.h.s. of eq. (31) is equivalent to

[TABLE]

Now,

[TABLE]

and so,

[TABLE]

Formally, this proves Theorem 1, but as a final remark, to complete the picture, we also argue that the there is no loss of tightness in the passage from the right–hand side of the first line of eq. (35) to $E_{\mbox{\tiny trtc}}(R,Q)$ . This follows from the following matching upper bound on the first line of (35). Let $\tilde{R}$ be such that the maximizer of $E_{\mbox{\tiny x}}(\rho,Q)-(2\rho-1)\tilde{R}$ is $\rho_{\mbox{\tiny trtc}}(R)$ . This is feasible due to the concavity of $E_{\mbox{\tiny x}}(\rho,Q)$ in $\rho$ [17, Theorem 3.3.2],

[TABLE]

Thus,

[TABLE]

5 Discussion

Several comments are in order concerning Theorem 1 and its proof.

Relations among the exponents. It is easy to see that $E_{\mbox{\tiny trtc}}(0)$ is equal to the zero–rate expurgated exponent, $E_{\mbox{\tiny ex}}(0,Q)=E_{\mbox{\tiny cex}}(0,Q)=\lim_{\rho\to\infty}E_{\mbox{\tiny x}}(\rho,Q)$ , and that for all $R<R_{0}(Q)$ ,

[TABLE]

In other words, the typical random trellis code exponent is between the convolutional coding random coding exponent and the convolutional coding expurgated exponent. This is parallel to the ordering among the corresponding the block code exponents [11]. These relations are displayed graphically in Fig. 2, where the concave curve of $E_{\mbox{\tiny x}}(\rho,Q)$ is plotted as a function of $\rho$ , along with the straight lines, $\rho R$ and $(2\rho-1)R$ . For $\rho=1$ , we have $E_{\mbox{\tiny x}}(1,Q)=E_{0}(1,Q)=R_{0}(Q)$ . The straight lines $\rho R$ and $(2\rho-1)R$ intersect at the point $(1,R)$ , which is below the point $(1,R_{0}(Q))$ on the curve (as $R$ is assumed smaller than $R_{0}(Q)$ ). The straight lines $\rho R$ and $(2\rho-1)R$ meet the curve $E_{\mbox{\tiny x}}(\rho,Q)$ at the points $(\rho_{\mbox{\tiny cex}}(R),R\cdot E_{\mbox{\tiny cex}}(R,Q))$ and $(\rho_{\mbox{\tiny trtc}}(R),R\cdot E_{\mbox{\tiny trtc}}(R,Q))$ , respectively. As can be seen, $R\cdot E_{\mbox{\tiny cex}}(R,Q))\geq R\cdot E_{\mbox{\tiny trtc}}(R,Q))\geq R_{0}(Q)$ .

Properties of the typical random trellis codes. For typical randomly selected trellis codes, we are able to characterize the features that make them achieve $E_{\mbox{\tiny trtc}}(R,Q)$ . This is, in fact, spelled out explicitly in the definition of the subset of typical codes, ${\cal T}_{k}$ . We know that for these codes, joint types that correspond to empirical distributions that are too far from $Q\times Q$ (e.g., those that exhibit too strong empirical dependency between the incorrect path and the correct one), are not populated. For the other types, we know the distance spectrum, or more precisely, the population profile of the various joint types.

Dominant error events. In the process of proving Theorem 1 in Section 4, we have seen also alternative forms of the error exponent expression, like the Csiszár–style expression (31). While this expression may not be easier to calculate numerically (due to the nested optimizations involved), it is nevertheless useful for gaining some insight. We learn the following from the first part of the derivation: the error probability is dominated by a sub–exponential number of incorrect paths whose joint empirical distribution with the correct path is given by

[TABLE]

and whose total unmerged length, $k+\ell$ (a.k.a. the critical length), spans

[TABLE]

branches.666Interestingly, this is different from the total critical length that dominates ordinary average error probability, which for $R<R_{0}$ , is $k$ branches long [17, Theorem 5.5.1]. The error exponent expression (31) is therefore essentially the same as that of a zero–rate777The zero rate is because of the sub–exponential number of dominant incorrect paths. block code of block length $K/[2R-D(P_{XX^{\prime}}^{*}\|Q\times Q)]$ , where the competing trellis paths are at normalized Bhattacharyya distance $\Delta_{1/2}(P_{XX^{\prime}}^{*})$ from the correct path, hence the product, $\Delta_{1/2}(P_{XX^{\prime}}^{*})/[2R-D(P_{XX^{\prime}}^{*}\|Q\times Q)]$ . For time–varying convolutional codes over the binary–input, output–symmetric channel, better performance is obtained (as discussed above) as one obtains [17, Corollary 5.3.1],

[TABLE]

with $Z=\sum_{y}\sqrt{W(y|0)W(y|1)}$ , which has the simple interpretation of the Costello lower bound on the free distance [3] multiplied by the corresponding Bhattacharyya bound (see also [18, p. 1652]). In other words, the typical time–varying convolutional code achieves the Costello bound. Note that the parameter $\rho$ in (37) controls the similarity (and hence the dominant distance) between $P_{XX^{\prime}}^{*}$ and the product distribution $Q\times Q$ . When $\rho$ is very large (at low rates), the dominant distance is large and when $\rho$ is very small (low rates), the distance is very small.

A numerical example. In [17, Chap. 5], there is a comparison of the performance–complexity trade-off between unstructured block codes and convolutional codes, where the performance is measured according to the traditional random coding error exponents. As explained therein, the idea is that for block codes of length $N$ and rate $R$ , the complexity is $G=2^{NR}$ and the error probability is exponentially $2^{-NE_{\mbox{\tiny block}}(R)}=G^{-E_{\mbox{\tiny block}}(R)/R}$ . For convolutional codes, decoded by the Viterbi algorithm, the complexity is about $G=2^{K}$ and the error probability decays like $2^{-KE_{\mbox{\tiny conv}}(R)}=G^{-E_{\mbox{\tiny conv}}(R)}$ , and so, it makes sense to compare $E_{\mbox{\tiny block}}(R)/R$ with $E_{\mbox{\tiny conv}}(R)$ , or more conveniently, to compare $E_{\mbox{\tiny block}}(R)$ with $R\cdot E_{\mbox{\tiny conv}}(R)$ . It is interesting to conduct a similar comparison when the performance of both classes of codes is measured according to error exponents of the typical random codes. In Fig. 3, this is done for the binary symmetric channel with crossover parameter $p=0.1$ and the uniform random coding distribution. For reference, the ordinary random coding exponent of convolutional codes, $R\cdot E_{\mbox{\tiny rtc}}(R,Q)\equiv R_{0}(Q)$ , is also plotted in the displayed range of rates. As can be seen, the typical code exponent of the ensemble of time–varying convolutional codes is much larger than that of block codes for the same decoding complexity.

6 Channels with Memory and Mismatch

In this section, we extend our main results in two directions at the same time. The first direction is that instead of assuming memoryless channels, we now allow channels that memorize a finite number of the most recent past inputs, with the clear motivation of channels with intersymbol interference (see also [17, Sect. 5.8]). For the sake of simplicity, we consider the case where the memory contains the one most recent past input only, in other words, the channel model (2) is replaced by

[TABLE]

The extension to any fixed number $p$ of the most recent past inputs is conceptually straightforward by redefining the channel input at time $t$ as $\bar{x}_{t}=(x_{t},\ldots,x_{t-p+1})$ and taking into account that in the sequence $\{\bar{x}_{t}\}$ not all $(J^{p})^{2}$ state transitions $\bar{x}_{t}\to\bar{x}_{t+1}$ are allowed, but only those in which the two states are consistent with each other. Using this transformation, we are back to the model (38), except that $\{x_{t}\}$ are replaced by $\{\bar{x}_{t}\}$ . The other direction of extension is that we allow mismatch. The decoding metric is assumed to be $\prod_{t}\tilde{W}(y_{t}|x_{t},x_{t-1})$ for some channel $\tilde{W}$ that may differ from $W$ . To avoid further complications, the ensemble of time–varying trellis codes continues to be defined exactly as in Section 2 (without any attempt at introducing memory). These model assumptions are motivated by the facts that: (i) they are practically relevant, and (ii) the Viterbi algorithm is still implementable, although the number of states is now larger than before. In the remaining part of this section, we will not repeat all the derivations of Section 4, but only highlight the differences and the state the results.

The first basic difference, relative to the derivation in Section 4, is associated the pairwise error probability: given the correct trellis path $x$ and a competing path $\mbox{\boldmath$ x $}^{\prime}$ , both of length $n(k+\ell)$ channel uses, the pairwise average error probability is upper bounded using the Chernoff bound as follows:

[TABLE]

where we have defined

[TABLE]

Note that here, it is no longer necessarily true that the optimal choice of $s$ is $s=1/2$ , as the symmetry properties that were valid in the memoryless matched case of Section 4, do not continue to hold here, in general. To make the derivation more tractable, in the sequel, we interchange the optimization over $s$ with the summation over $\{\mbox{\boldmath$ x $},\mbox{\boldmath$ x $}^{\prime}\}$ , at the possible risk of losing exponential tightness.888 Of course, one may always select $s=1/2$ , as in Section 4, and then Theorem 1 will still be obtained as a special case. The expression $\sum_{t=1}^{n(k+\ell)}d_{s}(x_{t},x_{t-1};x_{t}^{\prime},x_{t-1}^{\prime})$ depends on $(\mbox{\boldmath$ x $},\mbox{\boldmath$ x $}^{\prime})$ only via their joint “Markov type”, defined by the joint empirical distribution,

[TABLE]

ignoring edge effects. Let us denote

[TABLE]

Using the extension of the method of types to Markov types (see, e.g., [4, Sect. VII.A], [5], [6, Sect. 3.1], [12]), we find that

[TABLE]

where $\hat{H}(X,X^{\prime}|X_{-},X_{-}^{\prime})$ is the empirical conditional entropy of $(X,X^{\prime})$ given $(X_{-},X_{-}^{\prime})$ , derived from $\hat{P}_{XX^{\prime}X_{-}X_{-}^{\prime}}$ ,

[TABLE]

$\hat{P}_{XX^{\prime}|X_{-}X_{-}^{\prime}}$ being the conditional distribution induced by $\hat{P}_{XX^{\prime}X_{-}X_{-}^{\prime}}$ , and the minimization over $\{\hat{P}_{XX^{\prime}X_{-}X_{-}^{\prime}}\}$ is confined to joint distributions where the marginals of $(X,X^{\prime})$ and $(X_{-},X_{-}^{\prime})$ are the same. Repeating the same steps as in Section 4, and assuming that

[TABLE]

the resulting error exponent of the typical random trellis code is lower bounded by

[TABLE]

As for the inner–most minimization, let us define the functions

[TABLE]

and

[TABLE]

From large deviations theory [6, Sect. 3.1], we know that an alternative expression for $F_{s}(d)$ is given by

[TABLE]

where $G_{s}(r)=-\log\lambda_{s}(r)$ , $\lambda_{s}(r)$ being the Perron–Frobenius eigenvalue of the $J^{2}\times J^{2}$ matrix

[TABLE]

whose rows and columns are indexed by the pairs $(x,x^{\prime})$ and $(x_{-},x_{-}^{\prime})$ , respectively.999 This equivalence between the two forms of $F_{s}(d)$ follows from the fact that they are both expressions of the large deviations rate function [6, Sect. 3.1] of the probability of the event $\{\sum_{t=1}^{N}d_{s}(X_{t},X_{t-1};X_{t}^{\prime},X_{t-1}^{\prime})\leq Nd\}$ , where $\{X_{t}\}$ and $\{X_{t}^{\prime}\}$ are independent i.i.d. processes, both governed by $Q$ . Thus, given $d$ ,

[TABLE]

Equivalently, given that $2\hat{R}=F_{s}(d)$ ,

[TABLE]

But

[TABLE]

and so, similarly as in Section 4, the error exponent of the typical random trellis code is lower bounded by

[TABLE]

where $\rho_{R,s}$ is the solution to the equation $(2\rho-1)R=\rho G_{s}(1/\rho)$ . Note that $\rho G_{s}(1/\rho)$ is an extension of $E_{\mbox{\tiny x}}(\rho,Q)$ to a channel with both memory and mismatch. Using similar considerations, it is easy to see that $R_{0}(Q)$ of eq. (44) is equal to $\sup_{s\geq 0}G_{s}(1)$ .

Referring to the comment on the extension to channels with memory of the $p$ most recent past channel inputs (see the introductory paragraph of this section), the only difference is that in such a case, the matrix $A_{s}(r)$ has larger dimensions, $J^{2p}\times J^{2p}$ , but it is rather sparse: all entries vanish except those where both pairs $(x,x_{-})$ and $(x^{\prime},x_{-}^{\prime})$ are consistent.

Bibliography18

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Barg and G. D. Forney, Jr., “Random codes: minimum distances and error exponents,” IEEE Trans. Inform. Theory , vol. 48, no. 9, pp. 2568–2573, September 2002.
2[2] G. Battail, “On random–like codes,” Proc. 4th Canadian Workshop on Information Theory , pp. 76–94, Lac Delage, Quebec, Canada, May 1995.
3[3] D. J. Costello, Jr., ”Free distance bounds for convolutional codes,” IEEE Trans. Inform. Theory , vol. IT–20, no. 3, pp. 356–365, May 1974.
4[4] I. Csiszár, “The method of types,” IEEE Trans. Inform. Theory , vol. 44, no. 6, pp. 2505–2523, October 1998.
5[5] L. D. Davisson, G. Longo, and A. Sgarro, “The error exponent for noiseless encoding of finite ergodic Markov sources,” IEEE Trans. Inform. Theory , vol. IT–27, no. 4, pp. 431–438, July 1981.
6[6] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications , Jones and Bartlett Publishers, London, 1993.
7[7] G. D. Forney, Jr., “Convolutional codes II. maximum–likelihood decoding,” Information and Control , vol. 25, pp. 222–266, 1974.
8[8] R. Johannesson, “On the error probability of general trellis codes with applications to sequential decoding,” IEEE Trans. Inform. Theory , vol. IT–23, no. 5, pp. 609–611, September 1977.