Deep Learning Methods for Improved Decoding of Linear Codes

Eliya Nachmani; Elad Marciano; Loren Lugosch; Warren J. Gross; David; Burshtein; Yair Beery

arXiv:1706.07043·cs.IT·March 14, 2018

Deep Learning Methods for Improved Decoding of Linear Codes

Eliya Nachmani, Elad Marciano, Loren Lugosch, Warren J. Gross, David, Burshtein, Yair Beery

PDF

2 Repos

TL;DR

This paper explores how deep learning techniques can enhance the decoding of linear codes, improving performance and reducing complexity of traditional algorithms like belief propagation and min-sum.

Contribution

It introduces neural network-based modifications to standard decoders, including recurrent architectures and parameter tying, achieving better decoding accuracy with fewer parameters.

Findings

01

Deep learning improves belief propagation decoding performance.

02

Recurrent neural decoders match results with fewer parameters.

03

Neural decoders enhance BCH code decoding efficiency.

Abstract

The problem of low complexity, close to optimal, channel decoding of linear codes with short to moderate block length is considered. It is shown that deep learning methods can be used to improve a standard belief propagation decoder, despite the large example space. Similar improvements are obtained for the min-sum algorithm. It is also shown that tying the parameters of the decoders across iterations, so as to form a recurrent neural network architecture, can be implemented with comparable results. The advantage is that significantly less parameters are required. We also introduce a recurrent neural decoder architecture based on the method of successive relaxation. Improvements over standard belief propagation are also observed on sparser Tanner graph representations of the codes. Furthermore, we demonstrate that the neural belief propagation decoder can be used to improve the…

Equations47

l_{v} = lo g \frac{Pr ( C _{v} = 1∣ y _{v} )}{Pr ( C _{v} = 0∣ y _{v} )}

l_{v} = lo g \frac{Pr ( C _{v} = 1∣ y _{v} )}{Pr ( C _{v} = 0∣ y _{v} )}

x_{i, e = (v, c)} = l_{v} + e^{'} = (v, c^{'}), c^{'} \neq = c \sum x_{i - 1, e^{'}}

x_{i, e = (v, c)} = l_{v} + e^{'} = (v, c^{'}), c^{'} \neq = c \sum x_{i - 1, e^{'}}

x_{i, e = (v, c)} = 2 tanh^{- 1} e^{'} = (v^{'}, c), v^{'} \neq = v \prod tanh (\frac{x _{i - 1, e^{'}}}{2})

x_{i, e = (v, c)} = 2 tanh^{- 1} e^{'} = (v^{'}, c), v^{'} \neq = v \prod tanh (\frac{x _{i - 1, e^{'}}}{2})

o_{v} = l_{v} + e^{'} = (v, c^{'}) \sum x_{2 L, e^{'}}

o_{v} = l_{v} + e^{'} = (v, c^{'}) \sum x_{2 L, e^{'}}

x_{i, e = (v, c)} = tanh \frac{1}{2} w_{i, v} l_{v} + e^{'} = (v, c^{'}), c^{'} \neq = c \sum w_{i, e, e^{'}} x_{i - 1, e^{'}}

x_{i, e = (v, c)} = tanh \frac{1}{2} w_{i, v} l_{v} + e^{'} = (v, c^{'}), c^{'} \neq = c \sum w_{i, e, e^{'}} x_{i - 1, e^{'}}

x_{i, e = (v, c)} = 2 tanh^{- 1} e^{'} = (v^{'}, c), v^{'} \neq = v \prod x_{i - 1, e^{'}}

x_{i, e = (v, c)} = 2 tanh^{- 1} e^{'} = (v^{'}, c), v^{'} \neq = v \prod x_{i - 1, e^{'}}

o_{v} = σ w_{2 L + 1, v} l_{v} + e^{'} = (v, c^{'}) \sum w_{2 L + 1, v, e^{'}} x_{2 L, e^{'}}

o_{v} = σ w_{2 L + 1, v} l_{v} + e^{'} = (v, c^{'}) \sum w_{2 L + 1, v, e^{'}} x_{2 L, e^{'}}

L (o, y) = - \frac{1}{N} v = 1 \sum N y_{v} lo g (o_{v}) + (1 - y_{v}) lo g (1 - o_{v})

L (o, y) = - \frac{1}{N} v = 1 \sum N y_{v} lo g (o_{v}) + (1 - y_{v}) lo g (1 - o_{v})

x_{i, e = (v, c)} = e^{'} = (v^{'}, c), v^{'} \neq = v min ∣ x_{i - 1, e^{'}} ∣ e^{'} = (v^{'}, c), v^{'} \neq = v \prod sign (x_{i - 1, e^{'}})

x_{i, e = (v, c)} = e^{'} = (v^{'}, c), v^{'} \neq = v min ∣ x_{i - 1, e^{'}} ∣ e^{'} = (v^{'}, c), v^{'} \neq = v \prod sign (x_{i - 1, e^{'}})

x_{i, e = (v, c)} = w \cdot (e^{'} min ∣ x_{i - 1, e^{'}} ∣ e^{'} \prod sign (x_{i - 1, e^{'}})), e^{'} = (v^{'}, c), v^{'} \neq = v .

x_{i, e = (v, c)} = w \cdot (e^{'} min ∣ x_{i - 1, e^{'}} ∣ e^{'} \prod sign (x_{i - 1, e^{'}})), e^{'} = (v^{'}, c), v^{'} \neq = v .

x_{i, e = (v, c)} = w_{i, e = (v, c)} \cdot (e^{'} min ∣ x_{i - 1, e^{'}} ∣ e^{'} \prod sign (x_{i - 1, e^{'}})), e^{'} = (v^{'}, c), v^{'} \neq = v .

x_{i, e = (v, c)} = w_{i, e = (v, c)} \cdot (e^{'} min ∣ x_{i - 1, e^{'}} ∣ e^{'} \prod sign (x_{i - 1, e^{'}})), e^{'} = (v^{'}, c), v^{'} \neq = v .

x_{i, e = (v, c)} = max (e^{'} min ∣ x_{i - 1, e^{'}} ∣ - β, 0) e^{'} \prod sign (x_{i - 1, e^{'}}), e^{'} = (v^{'}, c), v^{'} \neq = v .

x_{i, e = (v, c)} = max (e^{'} min ∣ x_{i - 1, e^{'}} ∣ - β, 0) e^{'} \prod sign (x_{i - 1, e^{'}}), e^{'} = (v^{'}, c), v^{'} \neq = v .

x_{i, e = (v, c)} = max (e^{'} min ∣ x_{i - 1, e^{'}} ∣ - β_{i, e = (v, c)}, 0) e^{'} \prod sign (x_{i - 1, e^{'}}), e^{'} = (v^{'}, c), v^{'} \neq = v .

x_{i, e = (v, c)} = max (e^{'} min ∣ x_{i - 1, e^{'}} ∣ - β_{i, e = (v, c)}, 0) e^{'} \prod sign (x_{i - 1, e^{'}}), e^{'} = (v^{'}, c), v^{'} \neq = v .

x_{t, e = (v, c)} =

x_{t, e = (v, c)} =

\displaystyle\tanh\Biggl{(}\frac{1}{2}\Biggr{.}\Biggl{(}w_{v}l_{v}+\sum_{e^{\prime}=(c^{\prime},v),\>c^{\prime}\neq c}w_{e,e^{\prime}}x_{t-1,e^{\prime}}\Biggr{)}\Biggl{.}\Biggr{)}

x_{t, e = (c, v)} = 2 tanh^{- 1} e^{'} = (v^{'}, c), v^{'} \neq = v \prod x_{t, e^{'}}

x_{t, e = (c, v)} = 2 tanh^{- 1} e^{'} = (v^{'}, c), v^{'} \neq = v \prod x_{t, e^{'}}

o_{v, t} = σ \tilde{w}_{v} l_{v} + e^{'} = (c^{'}, v) \sum \tilde{w}_{v, e^{'}} x_{t, e^{'}}

o_{v, t} = σ \tilde{w}_{v} l_{v} + e^{'} = (c^{'}, v) \sum \tilde{w}_{v, e^{'}} x_{t, e^{'}}

L (o, y) = - \frac{1}{N} t = 1 \sum T v = 1 \sum N y_{v} lo g (o_{v, t}) + (1 - y_{v}) lo g (1 - o_{v, t})

L (o, y) = - \frac{1}{N} t = 1 \sum T v = 1 \sum N y_{v} lo g (o_{v, t}) + (1 - y_{v}) lo g (1 - o_{v, t})

x_{i, e = (v, c)} = w_{e = (v, c)} \cdot (e^{'} min ∣ x_{i - 1, e^{'}} ∣ e^{'} \prod sign (x_{i - 1, e^{'}})), e^{'} = (v^{'}, c), v^{'} \neq = v,

x_{i, e = (v, c)} = w_{e = (v, c)} \cdot (e^{'} min ∣ x_{i - 1, e^{'}} ∣ e^{'} \prod sign (x_{i - 1, e^{'}})), e^{'} = (v^{'}, c), v^{'} \neq = v,

x_{i, e = (v, c)} = max (e^{'} min ∣ x_{i - 1, e^{'}} ∣ - β_{e = (v, c)}, 0) e^{'} \prod sign (x_{i - 1, e^{'}}), e^{'} = (v^{'}, c), v^{'} \neq = v .

x_{i, e = (v, c)} = max (e^{'} min ∣ x_{i - 1, e^{'}} ∣ - β_{e = (v, c)}, 0) e^{'} \prod sign (x_{i - 1, e^{'}}), e^{'} = (v^{'}, c), v^{'} \neq = v .

m_{t}^{'} = γ m_{t - 1}^{'} + (1 - γ) m_{t}

m_{t}^{'} = γ m_{t - 1}^{'} + (1 - γ) m_{t}

m_{t}^{'} = m_{t - 1}^{'} + 0.125 (m_{t} - m_{t - 1}^{'}),

m_{t}^{'} = m_{t - 1}^{'} + 0.125 (m_{t} - m_{t - 1}^{'}),

s_{v} = l_{v} + e^{'} = (v, c^{'}) \sum x_{i - 1, e^{'}}

s_{v} = l_{v} + e^{'} = (v, c^{'}) \sum x_{i - 1, e^{'}}

x_{i, e} = s_{v} - x_{i - 1, e}

x_{i, e} = s_{v} - x_{i - 1, e}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Deep Learning Methods for Improved Decoding of Linear Codes

Eliya Nachmani, Elad Marciano, Loren Lugosch, Warren J. Gross, David Burshtein, and Yair Be’ery E. Nachmani, E. Marchiano, D. Burshtein and Y. Be’ery are with the School of Electrical Engineering, Tel-Aviv University, Tel-Aviv, 6997801 Israel, e-mail: [email protected], [email protected], [email protected], [email protected]. Lugosch and W. J. Gross are with the Department of Electrical and Computer Engineering, McGill University, Montréal, QC H3A 0G4, Canada, e-mail: [email protected], [email protected] work was presented in part in the Allerton 2016 conference and in ISIT 2017.

Abstract

The problem of low complexity, close to optimal, channel decoding of linear codes with short to moderate block length is considered. It is shown that deep learning methods can be used to improve a standard belief propagation decoder, despite the large example space. Similar improvements are obtained for the min-sum algorithm. It is also shown that tying the parameters of the decoders across iterations, so as to form a recurrent neural network architecture, can be implemented with comparable results. The advantage is that significantly less parameters are required. We also introduce a recurrent neural decoder architecture based on the method of successive relaxation. Improvements over standard belief propagation are also observed on sparser Tanner graph representations of the codes. Furthermore, we demonstrate that the neural belief propagation decoder can be used to improve the performance, or alternatively reduce the computational complexity, of a close to optimal decoder of short BCH codes.

Index Terms:

Deep learning, error correcting codes, belief propagation, min-sum decoding.

I Introduction

In recent years deep learning methods have demonstrated amazing performances in various tasks. These methods outperform human-level object detection in some tasks [1], they achieve state-of-the-art results in machine translation [2] and speech processing [3], and they attain record breaking performances in challenging games such as Go [4].

In this paper we suggest an application of deep learning methods to the problem of low complexity channel decoding. A well-known family of linear error correcting codes are the linear low-density parity-check (LDPC) codes [5]. LDPC codes achieve near Shannon channel capacity with the belief propagation (BP) decoding algorithm, but can typically do so for relatively large block lengths. For short to moderate high density parity check (HDPC) codes [6, 7, 8, 9, 10], such as common powerful linear algebraic codes, the regular BP algorithm obtains poor results compared to the optimal maximum likelihood (ML) decoder. On the other hand, the importance of close to optimal low complexity, low latency and low power decoders of short to moderate codes has grown with the emergence of applications driven by the Internet of Things.

Recently, in [11] it has been shown that deep learning methods can improve the BP decoding of HDPC codes using a weighted BP decoder. The BP algorithm is formulated as a neural network and it is shown that it can improve the decoding by $0.9{\rm dB}$ in the high SNR regime. A key property of the method is that it is sufficient to train the neural network decoder using a single codeword (e.g., the all-zero codeword), since the architecture guarantees the same error rate for any chosen transmitted codeword. Later, Lugosch & Gross [12] proposed an improved neural network architecture that achieves similar results to [11] with less parameters and reduced complexity. The main difference compared to [11] is that the offset min-sum algorithm is used instead of the sum-product algorithm, thus eliminating the need to use multiplications. Gruber et al. [13] proposed a neural network decoder with an unconstrained graph (i.e., fully connected network) and showed that the network gets close to the ML performance for very small block codes, $N=16$ . Also, O’Shea & Hoydis [14] proposed to use an autoencoder as a communication system for small block code with $N=7$ . In [15] it was suggested to improve an iterative decoding algorithm of polar codes by using neural network decoders of sub-blocks. In [16] deep learning-based detection algorithms were used when the channel model is unknown, and in [17] deep learning was used for MIMO detection. Deep learning was also applied to quantum error correcting codes [18].

In this work we elaborate on our work in [11] and [12] and extend it as follows111See the preprint [19].. First, we apply tying to the decoder parameters by using a recurrent neural network (RNN) architecture, and show that it can achieve up to $1.5{\rm dB}$ improvement over the standard belief propagation algorithm in the high SNR regime. The advantage over the feed-forward architecture in our initial work [11] is that it reduces the number of parameters. Similar improvements were obtained when applying tying to the neural min-sum algorithms. We introduce a new RNN decoder architecture based on the successive relaxation technique and show that it can achieve excellent performance with just a single learnable parameter. We also investigate the performance of the RNN decoder on parity check matrices with lower densities and fewer short cycles and show that despite the fact that we start with reduced cycle matrix, the network can improve the performance up to $1.0{\rm dB}$ . The output of the training algorithm can be interpreted as a soft Tanner graph that replaces the original one. State of the art decoding algorithms of short to moderate algebraic codes, such as [20, 7, 10], utilize the BP algorithm as a component in their solution. Thus, it is natural to replace the standard BP decoder with our trained RNN decoder, in an attempt to improve either the decoding performance or its complexity. In this work we demonstrate, for a BCH(63,36) code, that such improvements can be realized by using RNN decoders in the mRRD algorithm [7].

II Trellis representation of belief propagation

The renowned BP decoder [5], [21] can be constructed from the Tanner graph, which is a graphical representation of some parity check matrix that describes the code. In this algorithm, messages are transmitted over edges. Each node calculates its outgoing transmitted message over some edge, based on all incoming messages it receives over all the other edges. We start by providing an alternative graphical representation to the BP algorithm with $L$ full iterations when using parallel (flooding) scheduling. Our alternative representation is a trellis in which the nodes in the hidden layers correspond to edges in the Tanner graph. Denote by $N$ , the code block length (i.e., the number of variable nodes in the Tanner graph), and by $E$ , the number of edges in the Tanner graph. Then the input layer of our trellis representation of the BP decoder is a vector of size $N$ , that consists of the log-likelihood ratios (LLRs) of the channel outputs. The LLR value of variable node $v$ , $v=1,2,\ldots,N$ , is given by

[TABLE]

where $y_{v}$ is the channel output corresponding to the $v$ th codebit, $C_{v}$ .

All the following layers in the trellis, except for the last one (i.e., all the hidden layers), have size $E$ . For each hidden layer, each processing element in that layer is associated with the message transmitted over some edge in the Tanner graph. The last (output) layer of the trellis consists of $N$ processing elements that output the final decoded codeword. Consider the $i$ th hidden layer, $i=1,2,\ldots,2L$ . For odd (even, respectively) values of $i$ , each processing element in this layer outputs the message transmitted by the BP decoder over the corresponding edge in the graph, from the associated variable (check) node to the associated check (variable) node. A processing element in the first hidden layer ( $i=1$ ), corresponding to the edge $e=(v,c)$ , is connected to a single input node in the input layer: It is the variable node, $v$ , associated with that edge. Now consider the $i$ th ( $i>1$ ) hidden layer. For odd (even, respectively) values of $i$ , the processing node corresponding to the edge $e=(v,c)$ is connected to all processing elements in layer $i-1$ associated with the edges $e^{\prime}=(v,c^{\prime})$ for $c^{\prime}\neq c$ ( $e^{\prime}=(v^{\prime},c)$ for $v^{\prime}\neq v$ , respectively). For odd $i$ , a processing node in layer $i$ , corresponding to the edge $e=(v,c)$ , is also connected to the $v$ th input node.

The BP messages transmitted over the trellis graph are the following. Consider hidden layer $i$ , $i=1,2,\ldots,2L$ , and let $e=(v,c)$ be the index of some processing element in that layer. We denote by $x_{i,e}$ , the output message of this processing element. For odd (even, respectively), $i$ , this is the message produced by the BP algorithm after $\lfloor(i-1)/2\rfloor$ iterations, from variable to check (check to variable) node.

For odd $i$ and $e=(v,c)$ we have (recall that the self LLR message of $v$ is $l_{v}$ ),

[TABLE]

under the initialization, $x_{0,e^{\prime}}=0$ for all edges $e^{\prime}$ (in the beginning there is no information at the parity check nodes). The summation in (1) is over all edges $e^{\prime}=(v,c^{\prime})$ with variable node $v$ except for the target edge $e=(v,c)$ . Recall that this is a fundamental property of message passing algorithms [21].

Similarly, for even $i$ and $e=(v,c)$ we have,

[TABLE]

The final $v$ th output of the network is given by

[TABLE]

which is the final marginalization of the BP algorithm.

III A neural belief propagation decoder

We suggest the following parameterized deep neural network decoder that generalizes the BP decoder of the previous section. We use the same trellis representation for the decoder as in the previous section. The difference is that now we assign weights to the edges in the Tanner graph. These weights will be trained using stochastic gradient descent which is the standard method for training neural networks. More precisely, our decoder has the same trellis architecture as the one defined in the previous section. However, Equations (1), (2) and (3) are replaced by

[TABLE]

for odd $i$ ,

[TABLE]

for even $i$ , and

[TABLE]

where $\sigma(x)\equiv\left(1+e^{-x}\right)^{-1}$ is a sigmoid function that converts the LLR representation of the message to plain probability. It is easy to verify that the proposed message passing decoding algorithm (4)-(6) satisfies the message passing symmetry conditions [21, Definition 4.81]. Hence, by [21, Lemma 4.90], when transmitting over a binary memoryless symmetric (BMS) channel, the error rate is independent of the transmitted codeword. Therefore, to train the network, it is sufficient to use a database which is constructed by using noisy versions of a single codeword. For convenience we use the zero codeword, which must belong to any linear code. The database reflects various channel output realizations when the zero codeword has been transmitted. The goal is to train the parameters $\left\{w_{i,v},w_{i,e,e^{\prime}},w_{i,v,e^{\prime}}\right\}$ to achieve an $N$ dimensional output word which is as close as possible to the zero codeword. More precisely, we would like to minimize a cross entropy loss function at the last time step,

[TABLE]

Here $o_{v}$ and $y_{v}=0$ are the final deep neural network output and the actual $v$ th component of the transmitted codeword (which is always the zero codeword during the training).

The network architecture is a non-fully connected neural network. We use stochastic gradient descent to train the parameters. The motivation behind the new proposed parameterized decoder is that by setting the weights properly, one can compensate for small cycles in the Tanner graph that represents the code. That is, messages sent by parity check nodes to variable nodes can be weighted, such that if a message is less reliable since it is produced by a parity check node with a large number of small cycles in its local neighborhood, then this message will be attenuated properly.

The time complexity of the deep neural network is roughly the same as the plain BP algorithm, requiring an extra multiplication for each input message. Both have the same number of layers and the same number of non-zero weights in the Tanner graph. The deep neural network architecture is illustrated in Figure 1 for a BCH(15,11) code.

IV Neural min-sum decoding

The standard version of BP described above can be expensive to implement due to the repeated multiplications and hyperbolic functions used to compute the check node function. For this reason, the “min-sum” approximation is often used in practical decoder implementations. In min-sum decoding, Equations (1) and (3) are unchanged, and Equation (2) is replaced with Equation (8):

[TABLE]

The min-sum approximation tends to produce messages with large magnitudes, which makes the propagated information seem more reliable than it actually is, causing a BER degradation as a result. To compensate for this effect, the normalized min-sum (NMS) algorithm computes a message using the min-sum approximation, then shrinks the message magnitude using a small weight $w\in(0,1]$ , yielding Equation (9):

[TABLE]

Similar to the neural BP decoder described above, we propose to assign a learnable weight to each edge and train the decoder as a neural network, yielding a neural normalized min-sum (NNMS) decoder which generalizes the NMS decoder. The check-to-variable messages in NNMS are computed using

[TABLE]

where $w_{i,e=(v,c)}$ is the learnable weight for edge $(v,c)$ in layer $i$ . The weights serve a dual purpose: they correct for the min-sum approximation, and, like the weights in the neural BP decoder, they combat the effect of cycles in the Tanner graph.

Both the NNMS decoder and the neural BP decoder require many multiplications, which are generally expensive operations and avoided if possible in a hardware implementation. It was shown in [12] that decoders can learn to improve without using any multiplications at all by adapting the offset min-sum (OMS) algorithm. OMS decoding, like NMS decoding, shrinks a message before sending it, but by subtracting an offset from the message magnitude rather than multiplying by a weight:

[TABLE]

where $\beta$ is the subtracted offset and $\max\left(\dots,0\right)$ prevents the subtraction from flipping the sign of the message. In the same way that we can generalize NMS to yield NNMS decoding, we can generalize OMS to yield neural offset min-sum (NOMS) decoding, in which check nodes compute the following message:

[TABLE]

where $\beta_{i,e=(v,c)}$ is the learnable offset for edge $(v,c)$ in layer $i$ .

Note that the functions computed by check nodes in the min-sum decoders are not everywhere differentiable. As a result, the gradient is not defined everywhere. Nevertheless, the functions are non-differentiable only on lower dimensional curves in the space, and are differentiable in the rest of the space. Hence we are applying standard stochastic gradient descent, as is commonly done for neural networks which use activation functions with kinks (see e.g. [22]), such as rectified linear units (ReLU) which are widely used.

V RNN decoding

We suggest the following parameterized deep neural network decoder which is a constrained version of the BP decoder of the previous section. We use the same trellis representation that was described above for the decoder. The difference is that now the weights of the edges in the Tanner graph are tied, i.e. they are set to be equal in each iteration. This tying transfers the feed-forward architecture that was described above into a recurrent neural network architecture which we term BP-RNN. More precisely, the equations of the proposed architecture are

[TABLE]

where $t=1,2,\ldots$ is the iteration number, $x_{t,e=(v,c)}$ ( $x_{t,e=(c,v)}$ , respectively) denotes the BP-RNN message from variable node $v$ (check node $c$ ) to check node $c$ (variable node $v$ ) at iteration $t$ , and

[TABLE]

For iteration $t$ we also have

[TABLE]

We initialize the algorithm by setting $x_{0,e}=0$ for all $e=(c,v)$ . The proposed architecture also preserves the symmetry conditions. As a result the network can be trained by using noisy versions of a single codeword. The training is done as before with a cross entropy loss function, defined in (7), at the last time step. The proposed recurrent neural network architecture has the property that after every time step we can add final marginalization and compute the loss of these terms using cross entropy. Using multiloss terms can increase the gradient update of the backpropagation through time algorithm and allow learning the earliest layers. Hence, we suggest the following multiloss variant of (7):

[TABLE]

where $o_{v,t}$ and $y_{v}=0$ are the deep neural network output at the time step $t$ and the actual $v$ th component of the transmitted codeword. This network architecture is illustrated in Figure 2. Nodes in the variable layer implement (13), while nodes in the parity layer implement (14). Nodes in the marginalization layer implement (15). The training goal is to minimize (16).

Similarly, the neural min-sum decoders can be constrained to use the same weights or offsets for each time step, in which case we drop the iteration index from the learnable parameters, so that Equation (10) becomes:

[TABLE]

and Equation (12) becomes:

[TABLE]

VI Learning to relax

Another technique which can be used to improve the performance of belief propagation is the method of successive relaxation (or simply “relaxation”), as described in [23]. In relaxation, an exponentially weighted moving average is applied to combine the message sent in iteration $t-1$ with the raw message computed in iteration $t$ to yield a filtered message, $m_{t}^{\prime}$ :

[TABLE]

where $\gamma$ is the relaxation factor. As $\gamma\rightarrow 0$ , the decoder becomes less relaxed, and as $\gamma\rightarrow 1$ , the decoder becomes more relaxed. When $\gamma=0$ , the decoder reverts to being a normal, non-relaxed decoder.

As it is difficult to predict the behaviour of relaxed decoders analytically, the relaxation factor is chosen through trial-and-error, that is, by simulating the decoder with several possible values of $\gamma$ and choosing the value which leads to the best performance. Rather than choosing the relaxation parameter through brute force simulation, we propose learning this parameter using stochastic gradient descent, as the relaxation operation is differentiable with respect to $\gamma$ . Moreover, it is possible to use a separate relaxation parameter for each edge of the Tanner graph, similar to the other decoder architectures described in this work, although we have found that using per-edge relaxation parameters does not improve much upon using a single parameter. To the best of the authors’ knowledge, our deep learning methodology constitutes the first technique for optimizing decoder relaxation factors which does not rely on simple trial-and-error.

The structure of one possible version of a neural decoder with relaxation is illustrated in Figure 3. In this figure, $\times$ indicates elementwise multiplication, $+$ indicates elementwise addition, and $\gamma$ indicates the learnable relaxation parameter(s). Like the decoder shown in Figure 2, the relaxed decoder is effectively an RNN, with an additional “shortcut connection” (as is found in architectures such as those described in [24] and [1]). Since the relaxation operation can be considered an IIR filter, we require that $\gamma$ be in the range [0,1), as values outside of this range would result in filters with instability or ringing in the impulse response. During training, we use the sigmoid function to squash $\gamma$ into the correct range; during inference, the squashed values can be stored in the decoder so that the sigmoid need not be computed.

Relaxation is relatively expensive compared to the other neural decoder techniques, as it requires not only multiplications and additions but also additional memory to store the previous iteration’s messages. However, as was shown in [25], it is sometimes possible to set relaxation factors to a power of two so that a multiplier-free hardware implementation results, in which case the only additional overhead is memory and additions.

VII An mRRD algorithm with a neural BP decoder

Dimnik and Be’ery [7] proposed a modified random redundant iterative algorithm (mRRD) for decoding HDPC codes based on the RRD [26] and the MBBP [27] algorithms. The mRRD algorithm is a close to ML low complexity decoder for short length ( $N<100$ ) algebraic codes such as BCH codes. This algorithm uses $m$ parallel decoder branches, each comprising of $c$ applications of several (e.g. 2) BP decoding iterations, followed by a random permutation from the automorphism group of the code, as shown in Figure 4. The decoding process in each branch stops if the decoded word is a valid codeword. The final decoded word is selected with a least metric selector (LMS) as the one for which the channel output has the highest likelihood. More details can be found in [7].

We propose to combine the BP-RNN decoding algorithm with the mRRD algorithm. We can replace the BP blocks in the mRRD algorithm with our BP-RNN decoding scheme. The proposed mRRD-RNN decoder algorithm should achieve near ML performance with less computational complexity.

VIII Experiments And Results

The training of all neural networks was performed using TensorFlow [28].

VIII-A Neural BP decoders

We first present results for the neural versions of the standard BP decoder. We apply our method to different linear codes, BCH(63,45), BCH(63,36), BCH(127,64) and BCH(127,99). In all experiments the results on the training, validation and test sets were identical. That is, we did not observe overfitting in our experiments. Details about our experiments and results are as follows. It should be noted that the parameters $w_{i,v}$ in (4) and (6), $w_{v}$ in (13) and $\tilde{w}_{v}$ in (15) were all set to $1$ , since training these parameter did not yield additional improvements. Note that these weights multiply the self message from the channel, which is a reliable message, unlike the messages received from check nodes, which may be unreliable due to the presence of short cycles in the Tanner graph. Hence, this self message from the channel can be taken as is.

Training was conducted using stochastic gradient descent with mini-batches. The training data is created by transmitting the zero codeword through a binary input additive white Gaussian noise channel (BIAWGNC) with varying SNRs ranging from $1{\rm dB}$ to $8{\rm dB}$ . We applied the RMSPROP [29] rule, using its implementation in [28], with learning rate and mini-batch size that depended on the code used and on its parity check matrix. For all BCH codes with $N=63$ the learning rate was $0.001$ and the mini-batch size was $120$ . For a BCH(127,99) with right regular parity check matrix, the learning rate was $0.0003$ and the mini-batch size was $80$ . For a BCH(127,99) with cycle reduced parity check matrix and for BCH(127,64) the learning rate was $0.003$ and the mini-batch size was $40$ . All other parameters were set to their default values in the Tensorflow implementation of the RMSPROP optimizer. As is well known, the training of neural networks requires extensive trial and error tuning of hyper parameters [30, Appendix A]. This was also the case in our experiments that required proper selection of learning rates and mini-batch sizes, taking into account memory limitations of the GPU card as well. All neural networks had $2$ hidden layers at each time step, and unfold equal to $5$ which corresponds to $5$ full iterations of the BP algorithm. At test time, we inject noisy codewords after transmitting through a BIAWGNC and measure the bit error rate (BER) in the decoded codeword at the network output. The input $x_{t-1,e’}$ to (13) is clipped such that the absolute value of the input is always smaller than some positive constant $A<10$ . Similar clipping is typically applied in practical implementations of the BP algorithm.

VIII-A1 BER For BCH With $N=63$

In Figures 5, 6, we provide the bit-error-rate for a BCH code with $N=63$ using a right-regular parity check matrix based on [31]. As can be seen from the figures, the BP-RNN decoder outperforms the BP feed-forward (BP-FF) decoder by $0.2{\rm dB}$ . Not only that we improve the BER, the network has less parameters. Moreover, we can see that the BP-RNN decoder obtains comparable results to the BP-FF decoder when training with multiloss. Furthermore, for the BCH(63,45) and BCH(63,36) there is an improvement up to $1.3{\rm dB}$ and $1.5{\rm dB}$ , respectively, over the plain BP algorithm.

In Figures 7 and 8, we provide the bit-error-rate for a BCH code with $N=63$ for a cycle reduced parity check matrix [26]. For BCH(63,45) and BCH(63,36) we get an improvement up to $0.6{\rm dB}$ and $1.0{\rm dB}$ , respectively. This observation shows that the neural BP decoder is capable to improve the performance of standard BP even for reduced cycle parity check matrices. Thus answering in the affirmative the uncertainty in [11] regarding the performance of the neural decoder on a cycle reduced parity check matrix. The importance of this finding is that it enables a further improvement in the decoding performance, as BP (both standard BP and the new parameterized BP algorithm) yields lower error rate for sparser parity check matrices. However, as expected, the performance gain of the neural decoder compared to plain BP is lower for a sparser parity check matrix.

VIII-A2 BER For BCH With $N=127$

In Figure 9, we provide the bit-error-rate for BCH code with $N=127$ for right-regular parity check matrix based on [31]. As can be seen from the figure, for a right-regular parity check matrix, the BP-RNN and BP-FF decoders obtains an improvement of up to $1.0{\rm dB}$ over the BP, but the BP-RNN decoder uses significantly less parameters compared to BP-FF.

In Figures 10, 11 we provide the bit-error-rate for BCH code with $N=127$ for cycle reduced parity check matrix based on [26]. For BCH(127,64) and BCH(127,99) we get an improvement up to $0.9{\rm dB}$ and $1.0{\rm dB}$ respectively.

We assume that the channel is known. Plain BP also assumes a known channel in order to compute the LLRs from the channel observations. In addition to that, in order to train the neural decoder, we are using a varying SNR range for creating noisy codewords which are the input to the training algorithm. To assess the robustness with respect to the training SNR range, we trained the BP-RNN decoder for the BCH(127,64) code with a cycle reduced parity check matrix, using the following three SNR ranges: $1-4{\rm dB}$ , $5-8{\rm dB}$ and the full range $1-8{\rm dB}$ (as in Figure 10). The result is shown in Figure 12. As can be seen, the performance can be further improved by properly choosing the SNR range in the training, so that it would match the region of interest in actual test conditions.

VIII-B Neural min-sum decoders

Next, we present results for decoders which use the min-sum approximation. We trained neural min-sum decoders using the Adam optimizer [32], with multiloss and a learning rate of 0.1 for the NOMS decoders and 0.01 for the NNMS decoders. All other parameters were set to their default values in the Tensorflow implementation of the Adam optimizer.

Figure 13 and Figure 14 show the BER performance for the BCH(63,36) and BCH(63,45) codes, respectively, with non-sparsified parity check matrices. It can be seen from these plots that the decoders with multiplicative weights (BP-FF, BP-RNN, NNMS-FF, NNMS-RNN) achieve similar performance, implying that the min-sum approximation has little impact. It is also evident that decoders with multiplicative weights outperform decoders with additive offsets (NOMS-FF, NOMS-RNN), although the NOMS decoders still substantially outperform the non-neural decoders.

We also present results for an experiment with relaxed decoders. Fig. 15 compares the performance of a simple min-sum decoder with that of three relaxed min-sum decoders trained using the Adam optimizer with a learning rate of 0.01. All relaxed decoders outperform the simple min-sum decoder, achieving a coding gain similar to some of the other decoders presented here.

The evolution of the relaxation parameter $\gamma$ as training proceeds is plotted in Fig. 16. It can be seen in this figure that the first relaxed decoder learns a relaxation factor of roughly 0.863. However, note that 0.863 is close to 0.875, and by constraining the second relaxed decoder to using $\gamma=0.875$ instead of 0.863, we can rewrite (19) as follows:

[TABLE]

which can be implemented very efficiently in hardware, since 0.125 is a power of two ( $2^{-3}$ ) and requires no multiplier. As Fig. 15 shows, constraining the learned relaxation factor in this case causes a nearly imperceptible increase in BER. The third decoder is a relaxed NOMS decoder; it achieves even better performance than the other two relaxed decoders, showing that relaxation can be successfully combined with the learnable decoder building blocks described in [11] and [12].

Fig. 16 shows that the parameter $\gamma$ attains its final value after 400 minibatches of 120 frames each have been processed. Since the number of operations performed during the forward pass of training (belief propagation) is roughly equal to the number of operations performed during the backward pass of training (backpropagation), training the decoder in this case requires processing the equivalent of $400\times 120\times 2=$ 96,000 frames. In contrast, simulating the decoder until 100 frame errors have occurred at an SNR of $8{\rm dB}$ requires processing 1,064,040 frames. Naturally, testing the decoder with multiple different candidate values for $\gamma$ in search of the optimal value will require processing even more frames. In deep learning, gradient descent is usually applied to neural networks with a very large number of parameters. Our results for the relaxed min-sum decoder show that gradient-based learning can be a very efficient way of optimizing decoder parameters even when the number of parameters is small.

VIII-C mRRD-RNN and mRRD-NOMS

Finally, we provide the bit error rate results when using our proposed mRRD-RNN (BP version) decoder and when using a similar mRRD-NOMS decoder applied to a BCH(63,36) code represented by a cycle reduced parity check matrix based on [26]. In the experiments we use the BP-RNN with multiloss architecture and an unfold of 5, which corresponds to 5 BP iterations. The parameters of the mRRD-RNN are as follows. We use 2 iterations for each ${\rm BP}_{i,j}$ block in Figure 4, a value of $m=1,3,5,50$ , denoted in the following by mRRD-RNN( $m$ ), and a value of $c=30$ ( $c=50$ , respectively) when $m=1,3,5$ ( $m=50$ ). We also experimented with a similar mRRD-NOMS( $m$ ) algorithm, using the NOMS-RNN (Equation (18)), and the same setting of parameters as in the mRRD-RNN algorithm.

In Figure 17 we present the bit error rates for mRRD-RNN(1), mRRD-RNN(3), mRRD-RNN(5) and mRRD-RNN(50), and compare it to hard decision decoding (HDD) and to plain mRRD with the same parameters. As can be seen, we achieve improvements of $0.6$ dB, $0.3$ dB and $0.2$ dB compared to plain mRRD for $m=1,2,3$ . Hence, the mRRD-RNN decoder can improve on the plain mRRD decoder. The performance of the ML decoder was estimated using the implementation of [33] based on the ordered statistics decoder (OSD) algorithm [34] (see also [35]). Note that by increasing the value of $m$ , the gap to ML performance decreases towards zero.

Figure 18 compares the average number of BP iterations for the various decoders using plain mRRD and mRRD-RNN. As can be seen, there is a small increase in the complexity of up to 8% when using the RNN decoder. However, overall, with the RNN decoder one can achieve the same error rate with a significantly smaller computational complexity due to the reduction in the required value of $m$ . This improvement decreases when $m$ increases.

Figures 19 and 20 present a similar comparison between the mRRD and the mRRD-NOMS decoders. The NOMS decoder used is described by a modified version of Equation (18), which is now multiplied by a fixed weight of $1/2$ as in (9). The motivation for using this fixed attenuation is the same as in the NMS algorithm (see the motivating argument for (9) above). As can be seen, the mRRD-NOMS decoder improves the corresponding mRRD decoder with the same parameters both with respect to error rate and with respect to decoding time, throughout the SNR region that was examined.

IX Complexity and Comparison with other methods

When the block length is sufficiently large, LDPC codes under BP decoding can be used for efficient reliable communication, very close to channel capacity. However, as the block length decreases, other approaches can perform better. In particular, when transmitting over the BIAWGNC, BCH codes are very close to the best possible error rate for the given channel, block length and rate [33, 35].

In order to decode a BCH or any other linear code, one can use the OSD algorithm [34]. This algorithm requires a Gaussian elimination stage, followed by an exhaustive search over $\sum_{i=0}^{d}{K\choose i}$ possible codewords whose distances to the received channel observation vector need to be computed. Here, $K$ is the number of information bits and $d$ is a parameter of the algorithm. The OSD algorithm is computationally efficient for some channels, such as the BIAWGNC, and less efficient for other channels, such as the binary symmetric channel (BSC), which is important for applications such as coding for memories. This is due to the fact that computationally efficient decoding using OSD requires soft channel information. Another difficulty with the OSD is that the Gaussian elimination is difficult to implement efficiently in hardware for low latency communications since it is an inherently serial algorithm.

The RRD [26], MBBP [27] and mRRD [7] algorithms have been suggested as alternative low complexity, close to ML algorithms for HDPC codes such as BCH codes. In this work we have shown improvements compared to the mRRD algorithm. Both plain mRRD, and our neural mRRD decoders can be easily implemented in hardware, since the basic operation that needs to be performed is either (neural) BP or (neural) min-sum decoding. Consider the NOMS decoder with parameter tying described by Equations (1), (12) and and (3). As is well known, in order to implement (1) efficiently, one first computes

[TABLE]

for each variable node, $v$ , in the Tanner graph. Then, for each edge $e=(v,c)$ , $x_{i,e}$ is obtained using

[TABLE]

Thus, each iteration, efficient computation of (1) requires about $E$ summations. A similar idea can be applied in order to implement (12) efficiently, so that $O(E)$ operations are required each iteration (with no multiplications).

Note that in the definition of the BP-RNN decoder, Equation (13), we used the weights $w_{e,e^{\prime}}$ , while in the definitions of the neural min-sum decoders, (17) and (18), we used $w_{e}$ and $\beta_{e}$ . In fact, it is possible to use $w_{e^{\prime}}$ instead of $w_{e,e^{\prime}}$ also in (13), and our experience shows that the error rate obtained is about the same. In this case, the same efficient computation method indicated above for the NOMS decoder can also be used for the BP-RNN decoder, and the computational complexity remains $O(E)$ . However, using the NOMS decoder is advantageous since it does not require multiplications.

Finally, we note that in order to scale our decoders to longer block length codes, one can use the polar-concatenated scheme in [36, 37, 38, 39]. According to this approach, one can construct powerful longer block length codes from shorter constituent codes (e.g., BCH codes with block length 64), and decode them efficiently using computationally efficient close to ML decoders for the shorter constituent codes. A similar scaling approach for the decoding of polar codes was used in [15].

X Conclusion

We introduced neural architectures for decoding linear block codes. These architectures yield significant improvements over the standard BP and min-sum decoders. Furthermore, we showed that the neural network decoder improves on standard BP even for cycle reduced parity check matrices, with improvements of up to $1.5{\rm dB}$ in the SNR. We also showed performance improvement of the mRRD algorithm with the new RNN architecture. We regard this work as a first step towards the design of deep neural network-based decoding algorithms.

The decoders we introduce in this work offer a trade-off between error-correction performance and implementation complexity. For instance, while adders are more hardware-friendly than multipliers, decoders with additive offsets may not perform quite as well as those with multiplicative weights. Relaxed decoders outperform non-relaxed decoders, but relaxation requires additional memory to store a previous iteration’s messages. Which decoder one chooses depends on the needs of the application. In every instance, the use of machine learning improves the performance of the decoder.

Our future work includes possible improvements in the performance by exploring new neural network architectures. Moreover, we will investigate end-to-end learning of the mRRD algorithm (i.e. learning graph with permutation), and fine tune the parameters of the mRRD-RNN algorithm. Another direction is the consideration of an RNN architecture with quantized weights in order to reduce the number of free parameters. It has been shown in the past [40, 41] that in various applications the loss in performance incurred by weight quantization can be small if this quantization is performed properly. Finally, in some communication systems, e.g. [16], it is not possible to accurately model the channel. We believe that our proposed methods may be useful in these scenarios. It would also be interesting to consider the case where the channel used in training may deviate from the actual channel in test conditions.

ACKNOWLEDGMENT

We thank Jacob Goldberger for his comments on our work, Johannes Van Wonterghem and Joseph J. Boutros for making their OSD software available to us. We also thank Ilan Dimnik, Ofir Liba, Ethan Shiloh and Ilia Shulman for helpful discussion and their support to this research, and Gianluigi Liva for sharing with us his OSD Matlab package.

This research was supported by the Israel Science Foundation, grant no. 1082/13. The Tesla K40c used for this research was donated by the NVIDIA Corporation.

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2016, pp. 770–778.
2[2] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015 , 2015, pp. 1412–1421.
3[3] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing . IEEE, 2013, pp. 6645–6649.
4[4] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al. , “Mastering the game of go with deep neural networks and tree search,” Nature , vol. 529, no. 7587, pp. 484–489, 2016.
5[5] R. G. Gallager, Low Density Parity Check Codes . Cambridge, Massachusetts: M.I.T. Press, 1963.
6[6] J. Jiang and K. R. Narayanan, “Iterative soft-input soft-output decoding of Reed-Solomon codes by adapting the parity-check matrix,” IEEE Transactions on Information Theory , vol. 52, no. 8, pp. 3746–3756, 2006.
7[7] I. Dimnik and Y. Be’ery, “Improved random redundant iterative hdpc decoding,” IEEE Transactions on Communications , vol. 57, no. 7, pp. 1982–1985, 2009.
8[8] A. Yufit, A. Lifshitz, and Y. Be’ery, “Efficient linear programming decoding of hdpc codes,” IEEE Transactions on Communications , vol. 59, no. 3, pp. 758–766, 2011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Deep Learning Methods for Improved Decoding of Linear Codes

Abstract

Index Terms:

I Introduction

II Trellis representation of belief propagation

III A neural belief propagation decoder

IV Neural min-sum decoding

V RNN decoding

VI Learning to relax

VII An mRRD algorithm with a neural BP decoder

VIII Experiments And Results

VIII-A Neural BP decoders

VIII-A1 BER For BCH With N=63N=63N=63

VIII-A2 BER For BCH With N=127N=127N=127

VIII-B Neural min-sum decoders

VIII-C mRRD-RNN and mRRD-NOMS

IX Complexity and Comparison with other methods

X Conclusion

ACKNOWLEDGMENT

VIII-A1 BER For BCH With $N=63$

VIII-A2 BER For BCH With $N=127$