Reinforcement Learning for Channel Coding: Learned Bit-Flipping Decoding

Fabrizio Carpi; Christian H\"ager; Marco Martal\`o; Riccardo Raheli,; Henry D. Pfister

arXiv:1906.04448·cs.IT·December 10, 2019

Reinforcement Learning for Channel Coding: Learned Bit-Flipping Decoding

Fabrizio Carpi, Christian H\"ager, Marco Martal\`o, Riccardo Raheli,, Henry D. Pfister

PDF

1 Repo

TL;DR

This paper introduces a reinforcement learning approach to optimize bit-flipping decoding strategies for binary linear codes, achieving improved performance and efficiency over traditional heuristics.

Contribution

It presents a novel method of applying reinforcement learning to decode binary linear codes, replacing heuristic decisions with data-driven strategies.

Findings

01

Learned decoders outperform traditional heuristics.

02

Achieve near-optimal decoding performance in some cases.

03

Faster convergence when biasing learning towards correct decisions.

Abstract

In this paper, we use reinforcement learning to find effective decoding strategies for binary linear codes. We start by reviewing several iterative decoding algorithms that involve a decision-making process at each step, including bit-flipping (BF) decoding, residual belief propagation, and anchor decoding. We then illustrate how such algorithms can be mapped to Markov decision processes allowing for data-driven learning of optimal decision strategies, rather than basing decisions on heuristics or intuition. As a case study, we consider BF decoding for both the binary symmetric and additive white Gaussian noise channel. Our results show that learned BF decoders can offer a range of performance-complexity trade-offs for the considered Reed-Muller and BCH codes, and achieve near-optimal performance in some cases. We also demonstrate learning convergence speed-ups when biasing the learning…

Tables1

Table 1. TABLE I : Neural network parameters

layer	input	hidden	output
number of neurons	$M$	$500$ / $1500$	$N$
activation function	-	ReLU	linear

Equations29

π^{*} (s) = a \in A arg max Q (s, a) .

π^{*} (s) = a \in A arg max Q (s, a) .

a = {unif. random over A arg max_{a} Q (s, a) w.p. ε w.p. 1 - ε .

a = {unif. random over A arg max_{a} Q (s, a) w.p. ε w.p. 1 - ε .

Q (s, a) = s^{'} \sum P (s^{'} ∣ s, a) (R (s, a, s^{'}) + γ a^{'} \in A max Q (s^{'}, a^{'})) .

Q (s, a) = s^{'} \sum P (s^{'} ∣ s, a) (R (s, a, s^{'}) + γ a^{'} \in A max Q (s^{'}, a^{'})) .

L_{D} (θ) = (s, a, r, s^{'}) \in D \sum (r + γ a^{'} \in A max Q_{θ} (s^{'}, a^{'}) - Q_{θ} (s, a))^{2} .

L_{D} (θ) = (s, a, r, s^{'}) \in D \sum (r + γ a^{'} \in A max Q_{θ} (s^{'}, a^{'}) - Q_{θ} (s, a))^{2} .

c \in C arg max n = 1 \prod N P_{Y_{n} ∣ C_{n}} (y_{n} ∣ c_{n}) = c \in C arg max n = 1 \sum N (- 1)^{c_{n}} λ_{n},

c \in C arg max n = 1 \prod N P_{Y_{n} ∣ C_{n}} (y_{n} ∣ c_{n}) = c \in C arg max n = 1 \sum N (- 1)^{c_{n}} λ_{n},

λ_{n} ≜ ln \frac{P _{Y_{n} ∣ C_{n}} ( y _{n} ∣0 )}{P _{Y_{n} ∣ C_{n}} ( y _{n} ∣1 )}

λ_{n} ≜ ln \frac{P _{Y_{n} ∣ C_{n}} ( y _{n} ∣0 )}{P _{Y_{n} ∣ C_{n}} ( y _{n} ∣1 )}

e : z + e \in C arg max n = 1 \sum N (- 1)^{z_{n}} (- 1)^{e_{n}} λ_{n}

e : z + e \in C arg max n = 1 \sum N (- 1)^{z_{n}} (- 1)^{e_{n}} λ_{n}

= e : z + e \in C arg max n = 1 \sum N (- 1)^{e_{n}} ∣ λ_{n} ∣

= e : H e = s arg max n = 1 \sum N (- 1)^{e_{n}} ∣ λ_{n} ∣

= e : H e = s arg max n = 1 \sum N - e_{n} ∣ λ_{n} ∣

τ, a_{1}, \dots, a_{τ} : \sum_{t = 1}^{τ} h_{a_{t}} = s arg max t = 1 \sum τ - ∣ λ_{a_{t}} ∣,

τ, a_{1}, \dots, a_{τ} : \sum_{t = 1}^{τ} h_{a_{t}} = s arg max t = 1 \sum τ - ∣ λ_{a_{t}} ∣,

R (s, a, s^{'}) = {- c ∣ λ_{a} ∣ + 1 - c ∣ λ_{a} ∣ if s^{'} = 0 otherwise,

R (s, a, s^{'}) = {- c ∣ λ_{a} ∣ + 1 - c ∣ λ_{a} ∣ if s^{'} = 0 otherwise,

R (s, a, s^{'}) = {- \frac{1}{T} + 1 - \frac{1}{T} if s^{'} = 0 otherwise .

R (s, a, s^{'}) = {- \frac{1}{T} + 1 - \frac{1}{T} if s^{'} = 0 otherwise .

a = ⎩ ⎨ ⎧ unif. random over A unif. random over supp (e) ar g max_{a} Q (s, a) w.p. ε w.p. ε_{g} w.p. 1 - ε - ε_{g} .

a = ⎩ ⎨ ⎧ unif. random over A unif. random over supp (e) ar g max_{a} Q (s, a) w.p. ε w.p. ε_{g} w.p. 1 - ε - ε_{g} .

T (v) = A v + b,

T (v) = A v + b,

∣ λ_{n} ∣ = lo g \frac{1 - p _{n}}{p _{n}},

∣ λ_{n} ∣ = lo g \frac{1 - p _{n}}{p _{n}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fabriziocarpi/RLdecoding
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Reinforcement Learning for Channel Coding:

Learned Bit-Flipping Decoding

Fabrizio Carpi1, Christian Häger2, Marco Martalò3, Riccardo Raheli3, and Henry D. Pfister4 This work was done while F. Carpi was a student at University of Parma and was visiting Duke University. Preliminary results appeared in the thesis [1]. The work of C. Häger was supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant No. 749798. The work of H. D. Pfister was supported in part by the National Science Foundation (NSF) under Grant No. 1718494. Any opinions, findings, conclusions, and recommendations expressed in this material are those of the authors and do not necessarily reflect the views of these sponsors. Please send correspondence to [email protected].1Department of Electrical and Computer Engineering, New York University, Brooklyn, New York, USA2Department of Electrical Engineering, Chalmers University of Technology, Gothenburg, Sweden3Department of Engineering and Architecture, University of Parma, Parma, Italy4Department of Electrical and Computer Engineering, Duke University, Durham, North Carolina, USA

Abstract

In this paper, we use reinforcement learning to find effective decoding strategies for binary linear codes. We start by reviewing several iterative decoding algorithms that involve a decision-making process at each step, including bit-flipping (BF) decoding, residual belief propagation, and anchor decoding. We then illustrate how such algorithms can be mapped to Markov decision processes allowing for data-driven learning of optimal decision strategies, rather than basing decisions on heuristics or intuition. As a case study, we consider BF decoding for both the binary symmetric and additive white Gaussian noise channel. Our results show that learned BF decoders can offer a range of performance–complexity trade-offs for the considered Reed–Muller and BCH codes, and achieve near-optimal performance in some cases. We also demonstrate learning convergence speed-ups when biasing the learning process towards correct decoding decisions, as opposed to relying only on random explorations and past knowledge.

I Introduction

The decoding of error-correcting codes can be cast as a classification problem and solved using supervised machine learning. The general idea is to regard the decoder as a parameterized function (e.g., a neural network) and learn good parameter configurations with data-driven optimization [2, 3, 4, 5, 6, 7]. Without further restrictions on the code, this only works well for short codes and typically becomes ineffective for unstructured codes with more than a few hundred codewords. For linear codes, the problem simplifies considerably because one has to learn only a single decision region instead of one region per codeword. One can take advantage of linearity by using message-passing [4] or syndromes [5, 6]. Still, the problem remains challenging because good codes typically have complicated decision regions due to the large number of neighboring codewords. Near-optimal performance of learned decoders in practical regimes has been demonstrated, e.g., for convolutional codes [7], which possess even more structure.

In this paper, we study the decoding of binary linear block codes from a machine-learning perspective. Rather than learning a direct mapping from observations to estimated codewords (or bits) in a supervised fashion, the decoding is done in steps based on individual bit-flipping (BF) decisions. This allows us to map the problem to a Markov decision process (MDP) and apply reinforcement learning (RL) to find good decision strategies. Following [5, 6], our approach is syndrome-based and the state space of the MDP is formed by all possible binary syndromes, where bit-wise reliability information can be included for general memoryless channels. This effectively decouples the decoding problem from the transmitted codeword.

BF decoding has been studied extensively in the literature and is covered in many textbooks on modern coding theory, see, e.g., [8, 9, 10, 11, 12, 13], [14, Ch. 10.7]. Despite its ubiquitous use, and to the best of our knowledge, the learning approach to BF decoding presented in this paper is novel. In fact, with the exception of the recent work in [15], we were unable to find references that discuss RL for channel coding. Thus, we briefly review some other iterative decoding algorithms, based on sequential decision-making steps, for which RL is applicable. For a comprehensive survey of RL in the general context of communications, see [16].

II Channel Coding Background

Let $\mathcal{C}$ be an $(N,K)$ binary linear code defined by an $M\times N$ parity-check (PC) matrix $\mathbf{H}$ , where $N$ is the code length, $K$ is the code dimension, and $M\geq N-K$ . The code is used to encode messages into codewords $\bm{c}=\left(c_{1},...,c_{N}\right)^{\intercal}$ , which are then transmitted over the additive white Gaussian noise (AWGN) channel according to $y_{n}=(-1)^{c_{n}}+w_{n}$ , where $y_{n}$ is the $n$ -th component in the received vector $\bm{y}=\left(y_{1},...,y_{N}\right)^{\intercal}$ , $w_{n}\sim\mathcal{N}(0,(2RE_{\mathrm{b}}/N_{0})^{-1})$ , $R\triangleq K/N$ is the code rate, and we refer to $E_{\mathrm{b}}/N_{0}$ as the signal-to-noise ratio (SNR). The vector of hard-decisions is denoted by $\bm{z}=\left(z_{1},...,z_{N}\right)^{\intercal}$ , i.e., $z_{n}$ is obtained by mapping the sign of $y_{n}$ according to $+1\to 0$ , $-1\to 1$ . If the decoding is based only on the hard-decisions $\bm{z}$ , this scenario is equivalent to transmission over the binary symmetric channel (BSC).

II-A Decision Making in Iterative Decoding Algorithms

In the following, we briefly review several iterative decoding algorithms that involve a decision-making process at each step.

II-A1 Bit-Flipping Decoding

The general idea behind BF decoding is to construct a suitable metric that allows the decoder to rank the bits based on their reliability given the code constraints [14, Ch. 10.7]. In its simplest form, BF uses the hard-decision output $\bm{z}$ and iteratively looks for the bit that, after flipping it, would maximally reduce the number of currently violated PC equations. Pseudocode for standard BF decoding is provided in Alg. 1, where $\bm{e}_{n}\in\mathbb{F}_{2}^{N}$ is a standard basis vector whose $n$ -th component is $1$ and all other components are [math], $\mathbb{F}_{2}\triangleq\{0,1\}$ and $[N]\triangleq\{1,2,\dots,N\}$ . BF can be extended to general memoryless channels by including weights and thresholds to decide which bits to flip at each step. This is referred to as weighted BF (WBF) decoding, see, e.g., [8, 9, 10, 11, 12, 13], [14, Ch. 10.8] and references therein.

II-A2 Residual Belief Propagation

Belief propagation (BP) is an iterative algorithm where messages are passed along the edges of the Tanner graph representation of the code. In general, it is known that sequential message-passing schedules can lead to faster convergence than standard flooding schedules where multiple messages are updated in parallel. Residual BP (RBP) [17] is a particular instance of a sequential updating approach without a predetermined schedule. Instead, the message order is decided dynamically, where the decisions are based on the residual—defined as the norm of the difference between the current message and the message in the previous iteration. The residual is a measure of importance or “expected progress” associated with sending the message. In the context of decoding, various extensions of this idea have been investigated under the name of informed dynamic scheduling [18].

II-A3 Anchor Decoding

Consider the iterative decoding of product codes111Given a linear code $\mathcal{C}$ of length $n$ , the product code of $\mathcal{C}$ is the set of all $n\times n$ arrays such that each row and column is a codeword in $\mathcal{C}$ . over the BSC, where the component codes are iteratively decoded in some fixed order. For this algorithm, undetected errors in the component codes, so-called miscorrections, significantly affect the performance by introducing additional errors into the iterative decoding process. To address this problem, anchor decoding (AD) was recently proposed in [19]. The AD algorithm exploits conflicts due to miscorrections where two component codes disagree on the value of a bit. After each component decoding, a decision is made based on the number of conflicts whether the decoding outcome is indeed reliable. This can lead to backtracking previous component decoding outcomes and to the designation of reliable component codes as anchors.

II-B Decision Making Through Data-Driven Learning

While the above decoding algorithms appear in seemingly different contexts, the sequential decision-making strategies in the underlying iterative processes are quite similar. Decisions are typically made in a greedy fashion based on some heuristic metric that assesses the quality of each possible action. As concrete examples for this metric, we have

•

the decrease in the number of violated PC equations in BF decoding, measuring the reliability of bits;

•

the residual in RBP, measuring expected progress and the importance of sending messages;

•

the number of conflicts in AD, measuring the likelihood of being miscorrected.

In the next section, we review MDPs which provide a mathematical framework for modeling decision-making in deterministic or random environments. MDPs can be used to obtain optimal decision-making strategies, effectively replacing heuristics with data-driven learning of optimal metrics.

III Markov Decision Processes

A time-invariant MDP is a Markov random process $S_{0}$ , $S_{1}$ , $\dots$ whose state transition probability $P(s^{\prime}|s,a)\triangleq\mathbb{P}(S_{t+1}=s^{\prime}|S_{t}=s,A_{t}=a)$ is affected by the action $A_{t}$ taken by an agent based only on knowledge of past events. Here, $s,s^{\prime}\in\mathcal{S}$ and $a\in\mathcal{A}$ , where $\mathcal{S}$ and $\mathcal{A}$ are finite sets containing all possible states and actions. The agent also receives a reward $R_{t}=R(S_{t},A_{t},S_{t+1})$ which depends only on the states $S_{t}$ , $S_{t+1}$ and the action $A_{t}$ . The agent’s decision-making process is formally described by a policy $\pi:\mathcal{S}\to\mathcal{A}$ , mapping observed states to actions. The goal is to find an optimal policy $\pi^{*}$ that returns the best action for each possible state in terms of the total expected discounted reward $\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}R_{t}\right]$ , where $0<\gamma<1$ is the discount factor for future rewards.

If the transition and reward probabilities are known, dynamic programming can be used to compute optimal policies. If this is not the case, optimal policies can still be discovered through repeated interactions with the environment, assuming that the states and rewards are observable. This is known as RL. In the following, we describe two RL algorithms which will be used in the next sections.

III-A Q-learning

The most straightforward instance of RL is called Q-learning [20], where the optimal policy is defined in terms of the Q-function $Q:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ according to

[TABLE]

The Q-function measures the quality of actions and is formally defined as the expected discounted future reward when being in state $s$ , taking action $a$ , and then acting optimally. The key advantage of the Q-function is that it can be iteratively estimated from observations of any “sufficiently-random” agent. Pseudocode for Q-learning is given in Alg. 2, where a popular choice for generating the actions in line 5 is

[TABLE]

This is referred to as $\varepsilon$ -greedy exploration. For any $0<\varepsilon<1$ , this strategy is sufficient to allow Q-learning to eventually explore the entire state/action space. In the next section, we also describe an alternative exploration strategy for our application that can converge faster than $\varepsilon$ -greedy exploration.

To motivate the update equation in line 7 of Alg. 2, we note that the Q-function can be recursively expressed as

[TABLE]

This expression forms the theoretical basis for Q-learning which converges to the true Q-function under certain conditions222For example, if $R(s,a,s^{\prime})$ depends non-trivially on $s^{\prime}$ , then $\alpha$ must decay to zero at sufficiently slow rate.. For a more details, we refer the reader to [20, 21].

III-B Fitted Q-learning with Function Approximators

For standard Q-learning, one must store a table of $|\mathcal{S}|\times|\mathcal{A}|$ real values. This will be infeasible if either set is prohibitively large. The idea of fitted Q-learning is to learn a low-complexity approximation of $Q(s,a)$ [21]. Let $Q_{\theta}(s,a)$ be an approximation of the Q-function, parameterized by $\theta$ . Fitted Q-learning alternates between simulating the MDP and updating the current parameters to obtain a better estimate of the Q-function. In particular, assume that we have simulated and stored $B$ transition tuples $(s,a,r,s^{\prime})$ in a set $\mathcal{D}$ . Then, updating the parameters $\theta$ is based on reducing the empirical loss

[TABLE]

Pseudocode for fitted Q-learning is provided in Alg. 3, where gradient descent is used to update the parameters $\theta$ based on the loss (4). It is now common to choose $Q_{\theta}(s,a)$ to be a (deep) neural network (NN), in which case $\theta$ are the network weights and fitted Q-learning is called deep Q-learning.

IV Case Study: Bit-Flipping Decoding

In this section, we describe how BF decoding can be mapped to an MDP. In general, this mapping involves multiple design choices that affect the results. We therefore also comment on alternative choices and highlight some potential pitfalls that we encountered during this process.

IV-A Theoretical Background

We start by reviewing the standard maximum-likelihood (ML) decoding problem for a binary linear code $\mathcal{C}\subseteq\mathbb{F}_{2}^{N}$ over general discrete memoryless channels. The resulting optimization problem forms the basis for the reward function that is used in the MDP. To that end, consider a collection of $N$ discrete memoryless channels described by conditional probability density functions $\{P_{Y_{n}|C_{n}}(y_{n}|c_{n})\}_{n\in[N]}$ , where $c_{n}\in\mathbb{F}_{2}$ is the $n$ -th code bit and $y_{n}$ is the $n$ -th channel observation. The ML decoding problem can be written as

[TABLE]

where

[TABLE]

is the channel log-likelihood ratio (LLR). Equivalently, one can rewrite the maximization over all possible codewords in terms of error patterns as

[TABLE]

where $\bm{s}=\mathbf{H}\bm{z}$ is the observed syndrome.

Now, consider a multi-stage process where bit $a_{t}$ is flipped during the $t$ -th stage until the syndrome of the bit-flip pattern matches $\bm{s}$ . In this case, the optimization becomes

[TABLE]

where $\bm{h}_{n}$ is the $n$ -th column of the parity-check matrix $\mathbf{H}$ . By interpreting $-|\lambda_{a_{t}}|$ as a reward, one can see that the objective function in (11) has the same form as the cumulative reward (without discount) in an MDP. The following points are worth mentioning:

•

For the BSC, all LLRs have the same magnitude and (11) returns the shortest flip pattern that matches the observed syndrome.

•

For general channels, (11) returns the shortest weighted flip pattern that matches the syndrome, where the weighting is done according to the channel LLRs. In other words, the incurred penality for flipping bit $a_{t}$ is directly proportional to the reliability of the corresponding received bit.

•

If a bit is flipped multiple times, then there must be a shorter bit-flip sequence with lower cost and the same syndrome. Therefore, it is sufficient to only consider flip patterns that contain distinct bits.

IV-B Modeling the Markov Decision Process

IV-B1 Choosing Action and State Spaces

We assume that the action $A_{t}$ encodes which bit is flipped in the received word at time $t$ . Since there are $N$ possible choices, we simply use $\mathcal{A}=\{1,2,\dots,N\}\triangleq[N]$ . The state space $\mathcal{S}$ is formed by all possible binary syndromes of length $M$ . The initial state $S_{0}$ is the syndrome $\mathbf{H}\bm{z}$ and the next state is formed by adding the $A_{t}$ -th column of $\mathbf{H}$ to the current state. The transition probabilities $P(s^{\prime}|s,a)$ therefore take values in $\{0,1\}$ , i.e., the MDP is deterministic. The all-zero syndrome corresponds to a terminal state. We also enforce a limit of at most $T$ bit-flips per codeword. After this, we exit the current iteration and a new codeword will be decoded.333Strictly speaking, the resulting process is not an MDP unless the time $t$ is included in the state space.

*Remark 1**.*

For the BSC, we also tried (unsuccessfully) to learn BF decoding with fitted Q-learning directly from the channel observations using the state space $\mathbb{F}_{2}^{N}$ .

*Remark 2**.*

For the AWGN channel, the state space can be extended by including the reliability vector $\bm{r}=|\bm{y}|$ , similar to the setup in [6]. In this case, each state would correspond to a tuple $(\bm{s},\bm{r})$ , where $\bm{s}\in\mathbb{F}_{2}^{M}$ and $\bm{r}$ remains constant during decoding. In this paper, we follow a different strategy for BF decoding over the AWGN channel which relies on permuting the bit positions based on their reliability and subsequently discarding the channel LLRs prior to decoding. This approach is described in Sec. V and does not require any modifications to the state space.

IV-B2 Choosing the Reward Strategy

A natural reward function for decoding is to return $1$ if the codeword is decoded correctly and [math] otherwise. This would imply that an optimal policy minimizes the codeword error rate. However, the reward is only allowed to depend on the current/next state and the action, whereas the transmitted codeword and its estimate are defined outside the context of the MDP. Based on (11) and the discussion in the previous subsection, we instead use the reward function

[TABLE]

where $c>0$ is a scaling factor. The additional reward for matching the syndrome is required to prevent the decoder from just flipping the bits where $|\lambda_{a}|$ is minimal. For example, it could happen that a single error in position $a$ with large $|\lambda_{a}|$ matches the syndrome, but instead one chooses to flip $T$ bits with small absolute LLRs. The scaling factor $c$ is chosen such that the syndrome-matching reward $+1$ always dominates the expected cummulative term $-\sum_{t=1}^{T}{c|\lambda_{a_{t}}|}$ . As an example, for the BSC, $c$ is chosen such that the reward function becomes

[TABLE]

This reward function allows us to interpret optimal BF decoding as a “maze-playing game” in the syndrome domain where the goal is to find the shortest path to the all-zero syndrome. Applying a small negative penalty for each step is a standard technique to encourage short paths. Another alternative in this case is to choose a small discount factor $\gamma<1$ .

IV-B3 Choosing the Exploration Strategy

Compared to (2), we propose another exploration strategy as follows. Let $\bm{e}$ be the current error pattern, i.e., the channel error pattern plus any bit-flips that have been applied so far. Then, with probability $\varepsilon_{\mathrm{g}}$ , we choose the action randomly from $\operatorname{supp}(\bm{e})\triangleq\{i\in[N]\,|\,e_{i}=1\}$ , i.e., we flip one of the incorrect bits. When combined with $\varepsilon$ -greedy exploration, we refer to this as $(\varepsilon,\varepsilon_{\mathrm{g}})$ -goal exploration, where $\varepsilon,\varepsilon_{\mathrm{g}}>0$ and $0<\varepsilon+\varepsilon_{\mathrm{g}}<1$ :

[TABLE]

*Remark 3**.*

It may seem that biasing actions towards flipping erroneous bits leads to a form of supervised learning where the learned decisions merely imitate ground-truth decisions. To see that this is not exactly true, consider transmission over the BSC where the error pattern has weight $d_{\mathrm{min}}-1$ (where $d_{\mathrm{min}}$ is the minimum distance of the code) and the observation is at distance $1$ from a codeword $\tilde{\bm{c}}$ . Then, the optimal decision is to flip the bit that leads to $\tilde{\bm{c}}$ , whereas flipping an erroneous bit is suboptimal in terms of expected future reward, even though it moves us closer to the transmitted codeword $\bm{c}\neq\tilde{\bm{c}}$ .

IV-B4 Choosing the Function Approximator

We use fully-connected NNs with one hidden layer to represent $Q_{\theta}(s,a)$ in fitted Q-learning. In particular, the NN $\bm{f}_{\theta}$ maps syndromes to length- $N$ vectors $\bm{f}_{\theta}(\bm{s})\in\mathbb{R}^{N}$ and the Q-function is given by $Q_{\theta}(s,a)=[\bm{f}_{\theta}(\bm{s})]_{a}$ , where $[\cdot]_{n}$ returns the $n$ -th component of a vector and $\bm{s}$ is the syndrome for state $s$ . The NN parameters are summarized in Tab. I. In future work, we plan to explore other network architectures, e.g., multi-layer NNs or graph NNs based on the code’s Tanner graph.

V Learned Bit-Flipping with Code Automorphisms

Let $\mathcal{S}_{N}$ be the symmetric group on $N$ elements so that $\pi\in\mathcal{S}_{N}$ is a bijective mapping (or permutation) from $[N]$ to itself.444For a group $(G,\circ)$ , we also informally refer to the set $G$ as the group. In our context, the group operation $\circ$ represents function composition defined by $(\pi\circ\sigma)(i)=\pi(\sigma(i))$ . The permutation automorphism group of a code $\mathcal{C}$ is defined as $\operatorname*{PAut}(\mathcal{C})\triangleq\{\pi\in\mathcal{S}_{N}\,|\,\bm{x}^{\pi}\in\mathcal{C},\forall\bm{x}\in\mathcal{C}\}$ , where $\bm{x}^{\pi}$ denotes a permuted vector, i.e., $x_{i}^{\pi}=x_{\pi(i)}$ . The permutation automorphism group can be exploited in various ways to improve the performance of practical decoding algorithms, see, e.g., [22], [23]. In the context of learned decoders, the authors in [6] propose to permute the bit positions prior to decoding (and unpermute after) such that the channel reliabilities are approximately sorted. If the applied permutations are from $\operatorname*{PAut}(\mathcal{C})$ , the decoder simply decodes a permuted codeword, rather than the transmitted one. The advantage is that certain bit positions are now more reliable than others due to the (approximate) sorting. This can be advantageous in terms of optimizing parameterized decoders because of the additional structure that the decoder can rely on [6].

V-A A Permutation Strategy for Reed–Muller Codes

In [6], the permutation preprocessing approach is applied for Bose–Chaudhuri–Hocquenghem (BCH) codes and permutations are selected from $\operatorname*{PAut}(\mathcal{C})$ such that the total reliabilities of the first $K$ permuted bit positions are maximized, see [6, App. II] for details. In the following, we propose a variation of this idea for RM codes. In particular, our goal is to find a permutation that sends as many as possible of the least reliable bits to positions $\{0,1,2,4,\dots,2^{m-1}\}\triangleq\mathcal{B}$ . Recall that the automorphism group of RM $(r,m)$ is the general affine group of order $m$ over the binary field, denoted by AGL $(m,2)$ [24, Th. 24]. The group AGL $(m,2)$ is the set of all operators of the form

[TABLE]

where $\mathbf{A}\in\mathbb{F}_{2}^{m\times m}$ is an invertible binary matrix and $\bm{b},\bm{v}\in\mathbb{F}_{2}^{m}$ . By interpreting the vector $\bm{v}$ as the binary representation of a bit position index, (15) defines a permutation on the index set $\{0,1,\dots,N-1\}$ and thus on $[N]$ .

A set of vectors $\{\bm{v}_{0},\bm{v}_{1},\ldots,\bm{v}_{m}\}$ is called affinely independent if and only if the set $\{\bm{v}_{1}-\bm{v}_{0},\ldots,\bm{v}_{m}-\bm{v}_{0}\}$ is linearly independent. The binary representations of the indices in $\mathcal{B}$ correspond to the all-zero vector and all unit vectors of length $m$ . One can verify that they are affinely independent. The proposed strategy relies on the fact that, for any given set of $m+1$ affinely independent bit positions (in the sense that their binary representation vectors are affinely independent), there always exists a permutation in AGL $(m,2)$ such that the bit positions are mapped to $\mathcal{B}$ in any desired order. In particular, we perform the following steps to select the permutation prior to decoding:

Let $\pi$ be the permutation that sorts the reliability vector $\bm{r}=|\bm{y}|$ , i.e., $\bm{r}^{\pi}$ satisfies $r_{i}^{\pi}<r_{j}^{\pi}$ $\iff$ $i<j$ . 2. 2.

Find the first $m+1$ affinely independent indices for $\pi$ (e.g., using Gaussian elimination) and denote their binary representations by $\bm{v}_{0},\bm{v}_{1},\dots,\bm{v}_{m}$ . 3. 3.

The permutation is then defined by (15), where $\bm{b}=\bm{v}_{0}$ and the columns of $\mathbf{A}$ are $\bm{v}_{1}-\bm{v}_{0},\dots,\bm{v}_{m}-\bm{v}_{0}$ .

V-B (Approximate) Sort and Discard

For the learned BF decoders over the AWGN channel, our approach is to first apply the permutation strategy described in the previous section and subsequently discard the channel LLRs. From the perspective of the decoder, this scenario can be modeled as $N$ parallel BSCs, where the crossover probabilities for the bit positions in $\mathcal{B}$ satisfy $p_{0}>p_{1}>p_{2}>p_{4}>\dots>p_{2^{m-1}}$ . This is related to approaches where channel reliabilities are used to mark highly reliable and/or unreliable bit positions, while the actual decoding is performed without knowledge of the reliability values using hard-decision decoding, see, e.g., [25].

The absolute values of the channel LLRs for the parallel BSCs used in the reward function (12) are given by

[TABLE]

where $p_{n}$ is the crossover probability of the $n$ -th BSC. The individual crossover probabilities can be determined via Monte Carlo estimation before the RL starts. For example, Fig. 1 show the expected crossover probabilities after applying the proposed permutation strategy for RM $(32,16)$ assuming transmission at $E_{\mathrm{b}}/N_{0}=4$ dB.

*Remark 4**.*

One can estimate the capacity of strategies that permute the received bits using the reliabilities and then discard them. Fig. 2 shows the estimated information rates for the proposed strategy obtained via Monte Carlo averaging. Our results show that a significant fraction of the achievable information rate is preserved, especially for high-rate codes. For permutations restricted to AGL $(m,2)$ , this is less effective as the blocklength increases because the fraction of sorted channels satisfies $(m+1)/N=(\log_{2}(N)+1)/N$ .

VI Results

In this section, numerical results are presented for learned BF (LBF) decoders555 $\mathbf{H}$ -matrices and source code for the simulations are available online at https://github.com/fabriziocarpi/RLdecoding. We first used our own Tensorflow RL implementation and later switched to RLlib [27] in order to use multi-core parallelism for training rollouts. for the following RM and BCH codes:

•

RM $(32,16)$ with the standard $16\times 32$ PC matrix $\mathbf{H}_{\mathrm{std}}$ and overcomplete $620\times 32$ PC matrix $\mathbf{H}_{\mathrm{oc}}$ whose rows are all minimum-weight dual codewords, see [8, 26]

•

RM $(64,42)$ with the standard $22\times 64$ PC matrix $\mathbf{H}_{\mathrm{std}}$ and overcomplete $2604\times 64$ PC matrix $\mathbf{H}_{\mathrm{oc}}$

•

BCH $(63,45)$ with the standard $18\times 63$ circulant PC matrix $\mathbf{H}_{\mathrm{std}}$ and overcomplete $189\times 63$ PC matrix $\mathbf{H}_{\mathrm{oc}}$

•

RM $(128,99)$ with the standard $29\times 128$ PC matrix $\mathbf{H}_{\mathrm{std}}$ and overcomplete $10668\times 128$ PC matrix $\mathbf{H}_{\mathrm{oc}}$

For some of the considered codes, standard table Q-learning is feasible. For example, RM $(32,16)$ has $|\mathcal{S}|=2^{16}=65536$ and $|\mathcal{A}|=32$ so the Q-table has $|\mathcal{S}||\mathcal{A}|\approx 2\cdot 10^{6}$ entries.

VI-A Training Hyperparameters

In the following, we set the maximum number of decoding iterations to $T=10$ and the discount factor to $\gamma=0.99$ . For standard table Q-learning, the $(\varepsilon,\varepsilon_{\mathrm{g}})$ -goal exploration strategy is adopted with fixed $\varepsilon=0.6$ , $\varepsilon_{\mathrm{g}}=0.3$ , and learning rate $\alpha=0.1$ . For fitted Q-learning based on NNs, we use $\varepsilon$ -greedy exploration where $\varepsilon$ is linearly decreased from $0.9$ to [math] over the course of $0.9K$ learning episodes (i.e., number of decoded codewords), where the total number of episodes $K$ depends on the scenario. For the gradient optimization, the Adam optimizer is used with a batch size of $B=100$ and learning rate $\alpha=3\cdot 10^{-5}$ . The training SNR for both standard Q-learning and fitted Q-learning is fixed at $E_{\mathrm{b}}/N_{0}=5\,$ dB for RM $(128,99)$ and $E_{\mathrm{b}}/N_{0}=4\,$ dB for all other codes. In general, better performance may be obtained by re-optimizing parameters for each SNR or by adopting parameter adapter networks that dynamically adapt the network parameters to the SNR [28].

VI-B Learning Convergence in Q-Learning

We start by comparing the learning convergence of the proposed exploration strategy (14) to the $\varepsilon$ -greedy exploration for standard Q-learning assuming RM $(32,16)$ over the BSC. In Fig. 3, the obtained performance in terms of codeword error rate (CER) is shown as a function of the Q-learning iteration. The shown learning curves are generated as follows. During Q-learning, we always decode first the new channel observations (line 3 of Alg. 2) with the current Q-function without exploration and save the binary outcome (success/failure). Then, we plot a moving average (window size $5000$ ) of the outcomes to approximate the CER. It can be seen that the proposed strategy converges significantly faster than $\varepsilon$ -greedy exploration. We also show a learning curve for training when a reward of $1$ is given only for finding the transmitted codeword; in this case, however, the process is not an MDP (see Sec. IV) and the performance can become worse during training.

VI-C Binary Symmetric Channel

Fig. 4 shows the CER performance for all considered scenarios as a function of $E_{\mathrm{b}}/N_{0}$ . We start by focusing on the “hard-decision” decoding cases, which are equivalent to assuming transmission over the BSC. Supplementary bit error rate (BER) results for the same scenarios are shown in Fig. 5.

VI-C1 Baseline Algorithms

As a baseline for the LBF decoders over the BSC, we use BF decoding according to Alg. 1 (see also [8, Alg. II] and [14, Alg. 10.2]) applied to both the standard and overcomplete PC matrices $\mathbf{H}_{\mathrm{std}}$ and $\mathbf{H}_{\mathrm{oc}}$ , respectively. We also implemented optimal syndrome decoding for RM $(32,16)$ and BCH $(63,45)$ . In general, BF decoding shows relatively poor performance when applied to $\mathbf{H}_{\mathrm{std}}$ , whereas the performance increases drastically for $\mathbf{H}_{\mathrm{oc}}$ (see also [8, 26]). In fact, for RM $(32,16)$ , standard BF for $\mathbf{H}_{\mathrm{oc}}$ gives virtually the same performance as optimal decoding and the latter performance curves are omitted from the figure. This performance increase comes at a significant increase in complexity, e.g., for RM $(32,16)$ , the overcomplete PC matrix has $620$ rows compared to the standard PC matrix with only $16$ rows. For the BCH code, there still exists a visible performance gap between optimal decoding and BF decoding based on $\mathbf{H}_{\mathrm{oc}}$ .

VI-C2 Q-learning

From Figs. 4(a) and (b), it can be seen that the LBF decoders based on table Q-learning for RM $(32,16)$ and BCH $(63,45)$ converge essentially to the optimal performance. For RM $(64,42)$ in Fig. 4(c), the performance of LBF decoding is virtually the same as for standard BF decoding using $\mathbf{H}_{\mathrm{oc}}$ , which leads us to believe that both schemes are optimal in this case. These results show that the proposed RL approach is able to learn close-to-optimal flipping patterns given the received syndromes. Note that for RM $(128,99)$ , Q-learning would require a table with $|\mathcal{S}||\mathcal{A}|\approx 7\cdot 10^{10}$ entries which is not feasible to implement on our system.

VI-C3 Fitted Q-learning

The main disadvantage of the standard Q-learning approach is the large storage requirements of the Q-table. Indeed, the requirements are comparable to optimal syndrome decoding and this approach is therefore only feasible for short or very-high-rate codes. Therefore, we also investigate to what extend the Q-tables can be approximated with NNs and fitted Q-learning. The number of neurons in the hidden layer of the NNs is chosen to be $1500$ for RM $(128,99)$ and $500$ for all other cases. The achieved performance is shown in Fig. 4, labeled as “LBF-NN”. For the RM codes, it was found that good performance can be obtained using fitted Q-learning using the standard PC matrix $\mathbf{H}_{\mathrm{std}}$ . The performance loss compared to table Q-learning is almost negligible for RM $(32,16)$ and increases slightly for the longer RM codes. For the BCH code, we found that fitted Q-learning works better using $\mathbf{H}_{\mathrm{oc}}$ compared to $\mathbf{H}_{\mathrm{std}}$ . For this case, the gap compared to optimal decoding is less than $0.1\,$ dB at a CER of $10^{-3}$ .

VI-D AWGN Channel

Next, we consider the AWGN channel assuming that the reliability information is exploited for decoding.

VI-D1 Baseline Algorithms

Ordered statistics decoding (OSD) is used as a benchmark, whose performance is close to ML [29]. In this paper, we use order- $\ell$ processing where $\ell=3$ in all cases. Furthermore, we employ WBF decoding according to [14, Alg. 10.3] using $\mathbf{H}_{\mathrm{oc}}$ . Similar to BF decoding over the BSC, the performance of WBF is significantly better for overcomplete PC matrices compared to the standard ones (results for WBF on $\mathbf{H}_{\mathrm{std}}$ are omitted). From Fig. 4, WBF decoding is within $0.6$ – $1.1\,$ dB of OSD for the considered codes. We remark that there also exist a number of improved WBF algorithms which may reduce this gap at the expense of additional decoding complexity and the necessity to tune various weight and threshold parameters, see [8, 9, 10, 11, 12, 13]. For RM codes of moderate length, ML performance can also be approached using other techniques [30].

VI-D2 Q-Learning

As explained in Sec. V, our approach to LBF decoding over the AWGN channel in this paper consists of permuting the bit positions based on $\bm{r}$ and subsequently discarding the reliability values. For the RM codes, the particular permutation strategy is described in Sec. V. The performance results for standard Q-learning shown in Figs. 4(a) and (c) (denoted as “s+d LBF”) demonstrate that this strategy performs quite close to WBF decoding and closes a significant fraction of the gap to OSD, even though reliability information is only used to select the permutation and not for the actual decoding. For the BCH code, we use the same permutation strategy as described in [6]. In this case, however, the performance improvements due to applying the permutations are relatively limited.

VI-D3 Fitted Q-Learning

For the NN-based approximations of the Q-tables for the sort-and-discard approach, we use the NN sizes from the previous section for the BSC. In this case, fitted Q-learning obtains performance close to the standard Q-learning approach for RM codes. Similar to the BSC, the performance gap is almost negligible for RM $(32,16)$ and increases for the longer RM codes. For RM $(128,99)$ , sort-and-discard LBF decoding with NNs closes roughly half the gap between soft-decision ML (approximated via OSD) and hard-decision ML (approximated via BF on $\mathbf{H}_{\mathrm{oc}}$ ).

VII Conclusion

In this paper, we have proposed a novel RL framework for BF decoding of binary linear codes. It was shown how BF decoding can be mapped to a Markov decision process by properly choosing the state and action spaces, whereas the reward function can be based on a reformulation of the ML decoding problem. In principle, this allows for data-driven learning of optimal BF decision strategies. Both standard (table-based) and fitted Q-learning with NN function approximators were then used to learn good decision strategies from data. Our results show that the learned BF decoders can offer a range of performance–complexity trade-offs.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] F. Carpi, “Exploring machine learning algorithms for decoding linear block codes,” Master Thesis, University of Parma, Italy, Oct. 2018.
2[2] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, “On deep learning-based channel decoding,” in Proc. Annual Conf. Information Sciences and Systems (CISS) , Baltimore, MD, 2017.
3[3] T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Trans. Cogn. Commun. Netw. , vol. 3, no. 4, pp. 563–575, Dec. 2017.
4[4] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode linear codes using deep learning,” in Proc. Annual Allerton Conference on Communication, Control, and Computing , Monticello, IL, 2016.
5[5] L. G. Tallini and P. Cull, “Neural nets for decoding error-correcting codes,” in Proc. IEEE Technical Applications Conf. and Workshops , Portland, USA, 1995.
6[6] A. Bennatan, Y. Choukroun, and P. Kisilev, “Deep learning for decoding of linear codes - a syndrome-based approach,” in Proc. IEEE Int. Symp. Information Theory (ISIT) , Vail, CO, 2018.
7[7] H. Kim, Y. Jiang, R. Rana, S. Kannan, S. Oh, and P. Viswanath, “Communication algorithms via deep learning,” in Proc. Int. Conf. Learning Representations (ICLR) , Vancouver, Canada, 2018.
8[8] M. Bossert and F. Hergert, “Hard- and soft-decision decoding beyond the half minimum distance—an algorithm for linear codes (corresp.),” IEEE Trans. Inf. Theory , vol. 32, no. 5, pp. 709–714, Sept. 1986.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Reinforcement Learning for Channel Coding:

Abstract

I Introduction

II Channel Coding Background

II-A Decision Making in Iterative Decoding Algorithms

II-A1 Bit-Flipping Decoding

II-A2 Residual Belief Propagation

II-A3 Anchor Decoding

II-B Decision Making Through Data-Driven Learning

III Markov Decision Processes

III-A Q-learning

III-B Fitted Q-learning with Function Approximators

IV Case Study: Bit-Flipping Decoding

IV-A Theoretical Background

IV-B Modeling the Markov Decision Process

IV-B1 Choosing Action and State Spaces

Remark 1*.*

Remark 2*.*

IV-B2 Choosing the Reward Strategy

IV-B3 Choosing the Exploration Strategy

Remark 3*.*

IV-B4 Choosing the Function Approximator

V Learned Bit-Flipping with Code Automorphisms

V-A A Permutation Strategy for Reed–Muller Codes

V-B (Approximate) Sort and Discard

Remark 4*.*

VI Results

VI-A Training Hyperparameters

VI-B Learning Convergence in Q-Learning

VI-C Binary Symmetric Channel

VI-C1 Baseline Algorithms

VI-C2 Q-learning

VI-C3 Fitted Q-learning

VI-D AWGN Channel

VI-D1 Baseline Algorithms

VI-D2 Q-Learning

VI-D3 Fitted Q-Learning

VII Conclusion

*Remark 1**.*

*Remark 2**.*

*Remark 3**.*

*Remark 4**.*