MIND: Model Independent Neural Decoder
Yihan Jiang, Hyeji Kim, Himanshu Asnani, Sreeram Kannan

TL;DR
MIND is a neural decoder that uses meta-learning to quickly adapt to different channels with minimal data, outperforming static decoders and approaching the performance of channel-specific neural decoders.
Contribution
The paper introduces MIND, a neural decoder utilizing MAML for rapid adaptation to varying channels, reducing retraining data needs and improving robustness.
Findings
MIND outperforms static neural decoders significantly.
MIND approaches the performance of channel-specific neural decoders.
MIND generalizes well to unseen channels.
Abstract
Standard decoding approaches rely on model-based channel estimation methods to compensate for varying channel effects, which degrade in performance whenever there is a model mismatch. Recently proposed Deep learning based neural decoders address this problem by leveraging a model-free approach via gradient-based training. However, they require large amounts of data to retrain to achieve the desired adaptivity, which becomes intractable in practical systems. In this paper, we propose a new decoder: Model Independent Neural Decoder (MIND), which builds on the top of neural decoders and equips them with a fast adaptation capability to varying channels. This feature is achieved via the methodology of Model-Agnostic Meta-Learning (MAML). Here the decoder: (a) learns a "good" parameter initialization in the meta-training stage where the model is exposed to a set of archetypal channels and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Domain Adaptation and Few-Shot Learning · Geophysical Methods and Applications
MIND: Model Independent Neural Decoder
Yihan Jiang
ECE Department
University of Washington
Seattle, United States
Hyeji Kim
Samsung AI Center Cambridge
Cambridge, United Kingdom
Himanshu Asnani
ECE Department
University of Washington
Seattle, United States
Sreeram Kannan
ECE Department
University of Washington
Seattle, United States
Abstract
Standard decoding approaches rely on model-based channel estimation methods to compensate for varying channel effects, which degrade in performance whenever there is a model mismatch. Recently proposed Deep learning based neural decoders address this problem by leveraging a model-free approach via gradient-based training. However, they require large amounts of data to retrain to achieve the desired adaptivity, which becomes intractable in practical systems.
In this paper, we propose a new decoder: Model Independent Neural Decoder (MIND), which builds on the top of neural decoders and equips them with a fast adaptation capability to varying channels. This feature is achieved via the methodology of Model-Agnostic Meta-Learning (MAML). Here the decoder: (a) learns a "good" parameter initialization in the meta-training stage where the model is exposed to a set of archetypal channels and (b) updates the parameter with respect to the observed channel in the meta-testing phase using minimal adaptation data and pilot bits. Building on top of existing state-of-the-art neural Convolutional and Turbo decoders, MIND outperforms the static benchmarks by a large margin and shows minimal performance gap when compared to the neural (Convolutional or Turbo) decoders designed for that particular channel. In addition, MIND also shows strong learning capability for channels not exposed during the meta training phase.
I Introduction
I-A Motivation
Ever since the ground-breaking work in [16], capacity-approaching codes for Additive White Gaussian Noise (AWGN) channel such as Turbo codes [18], Low Density Parity Check (LDPC) codes [19] and Polar codes [17] have been proposed and extensively studied in the last few decades and have been used widely in Long Term Evolution (LTE) and 5G standards. Efficient decoding methods are known for the capacity-approaching codes, and they exhibit near-optimal performance on the Gaussian noise (AWGN) channel. However, the performance on non-AWGN channels is not uniformly optimal. Designing the corresponding decoders to deal with non-Gaussianity is hard, primarily owing to a two-fold deficit: (a) model-deficit, which implies the inability of accurately expressing the observed data by a clean mathematical model, and (b) algorithm deficit, which implies even under a clean abstraction, the optimal decoding algorithm is not known [34]. Thus while resorting to using the optimal codes designed under simplified models such as the AWGN channel, designing a decoder that can adapt to the non-AWGN channel effects faces challenges on both these fronts: there is a model mismatch and furthermore, most non-AWGN channels do not permit closed-form optimal decoders.
Tremendous amount of effort has been invested to develop a suite of handcrafted algorithms to circumvent these deficits. These comprise of model-based methods in channel estimation, signal preprocessing, as well as robust decoding under unexpected channel effects [1], so as to make the AWGN-designed capacity-approaching decoders operate with minimal degradation [20]. Few pilot bits known by both the transmitter and the receiver are used to estimate the channel effects to compensate for their varying nature, while handcrafted decoding algorithms have been applied to improve the decoder’s robustness [20]. However they lack in two respects: (1) Channel estimation and channel-effect equalizing algorithms are model-based, hence when the underlying mathematical abstraction suffers from model-deficit, there is a suboptimal performance. (2) AWGN-designed decoders are not robust to unexpected and uncompensated noises.
I-B Prior Art : Neural Decoding
In the past decade, data-driven deep learning based methods have changed the landscape of several engineering fields such as computer vision and natural language processing, with revolutionary performance benchmarks [11] [12]. Applying general purpose deep learning models to channel coding design has received intensive attention recently [27] [28]. Designing such neural decoders naturally fits well with the data-driven supervised learning approaches, since both the received signals and the target messages can be simulated from the underlying encoder and channel models. In this way, both the model-deficit and algorithm-deficit are navigated by directly training a neural decoder on the sampled data.
Designing neural decoder for several classes of codes such as LDPC codes, Polar codes and Turbo codes with versatile deep neural networks has seen a growing interest within the channel coding community. Imitating Belief Propagation (BP) algorithm via learnable neural networks shows promising performance for High-Density Parity-Check (HDPC) codes and LDPC codes [29] [30] and Polar codes [31] [32]. Near optimal performance of Convolutional Code and Turbo Code under AWGN channel is achieved via Recurrent Neural Networks (RNN) for arbitrary block lengths [33], which also shows robust and adaptive performance under non-AWGN setups. A further extension of RNN encoders (and decoders) reveal state-of-the-art performance for feedback channels [37] and low latency schemes [34]. Thus while neural decoders show the promise of alleviating model and algorithm deficits, compared to the traditional decoding methods which utilize limited amount of pilot bits to adapt, neural decoders require a huge amount of data (information complexity) and long computation time (computational complexity) to adapt to the new channel. This serious drawback renders them quite intractable and far from practical deployment. The relevant question we ask here is the following: Can we design neural decoders that strengthen their adaptive property, so that only minimal re-training is necessary? In what follows, this question is investigated and answered in affirmative.
I-C Our Contribution
We introduce meta learning to navigate the data-hungry nature of the neural decoder. Meta learning operates in two steps: (a) it firstly performs meta training phase by learning on a wide range of archetypal tasks, and then (b) during the meta testing phase enables learning new tasks faster, while consuming less adaptation data than learning from scratch [4]. Supervised meta learning has a natural connection to adaptive decoder design, as we can consider different channels as different tasks in our meta learning framework.
RNN-based meta learning considers the whole meta learning approach as a large-scale RNN with tasks as inputs [13]. However, this requires complex modeling and thus shows degradation in performance with respect to scalability. Model Agnostic Meta Learning (MAML) [6] is a gradient-based meta-learning algorithm that learns a sensitive initialization for fast adaptation. MAML trained model performs well on new tasks with limited gradient update steps and few-shot adaptation data. Compared to other meta learning methods, MAML has much less complexity. Moreover, theoretically MAML is shown to be able to approximate any meta learning algorithm [7] and when faced with out-of-domain tasks, MAML shows fast capability to adapt, despite the fact that the out-of-domain tasks may not be close to the meta-trained tasks [6].
In this work, we present a MAML-based neural decoder: Model Independent Neural Decoder (MIND), which admits fast adaptation with few shot adaptation data utilizing the gradient-based training. Compared to the adaptive neural decoders which require large amounts of gradient training steps and data to adapt to new channel settings, MIND can adapt to a new channel with small amount of pilot bits and few gradient descent steps. Compared to the traditional adaptive decoding method, MIND offers a model-free gradient-based meta-learning approach built on the top of neural decoders, resolving both the model and the algorithm deficit. Thus, MIND enhances the advantages of neural decoders with data and computational efficiency.
The paper is organized as follows: Section II discusses the details of MAML which builds on the top of neural decoders to results in our proposed decoder: MIND. Section III analyzes the performance of MIND which shows very near-optimal performance with few shot adaptation data, under both trained and untrained channels. Section IV concludes with the scope and limitations of MIND and discussion on the future directions.
II Model Independent Neural Decoder
We consider the two neural decoders for Convolutional Code and Turbo Code respectively [33] to develop MIND (for details refer to the Appendix). Both these neural decoders have larger number of parameters compared to the traditional algorithms to deal with the issues of model deficit and algorithm deficit. However, training neural decoders till convergence requires large amounts of data. This leads to a slow adaptation with costly computations. In what follows, we propose the remedy through MAML, which are described below along with the choice of Loss function and the hyper-parameters:
Loss Function:
For neural decoders, the loss function is Binary Cross-Entropy (BCE) since decoder is a classification task. is the neural decoder with parameter . Formally speaking, we are given a collection of training channels . For a specific channel with sampled received signal and target message , the loss function associated with a particular channel can be represented as:
[TABLE]
Meta Training Phase:
The meta training objective is to learn a sensitive initial weight for all the training channels. This operates as per the following two sub-steps:
- •
Task Update: For each channel , MIND updates the model weights to with adaptation learning rate . This is called task update as the update for the parameter is done for each task, here channel. The updated weights should learn themselves to be close to the optimal decoder for each channel .
- •
Meta Update: Here, the goal is to do a meta update or to minimize the following loss for all training channels with respect to :
[TABLE]
which via gradient descent with meta learning rate , is equivalent to the following update:
[TABLE]
Computing the above gradient is equivalent to computing the gradient of gradient of the BCE loss. Second order gradients as in Eq. (3) are expensive. In this paper, we use First-Order MAML (FOMAML) [41], which treats higher order gradients as constant, thus ignoring the second-order terms. Note it is this step above which distinguishes such a training phase with the vanilla average learning, known as Multi-task Learning (MTL) [15], where instead of Eq. (3) the following assignment via the average of gradients on all the channels is used:
[TABLE]
Meta Testing Phase:
During the meta testing phase, firstly pilot bits from the new channel are collected. Then the is updated via gradient descent . MIND’s meta training and testing phase is depicted in the appendix.
Note: During the meta training phase, the data to compute task update and the data for computing meta update are different. Using the same data for both the task update and the meta update leads to meta-overfitting [6]. It is due to this reason for training each , we need to sample twice for meta training, while during the meta testing phase each step only requires to sample once.
Hyperparameters:
The MIND trained Neural Decoders for Convolutional and Turbo Code are trained with the following hyper-parameters as shown in Figure 1. Batch size refers to the number of blocks sampled from one specific channel for training (also referred as mini-batch size), which is the same for both meta training and meta testing phase. Meta batch size refers to the number of random channels utilized for each meta training update step. Meta training is expensive, which uses 50000 training steps to conduct Meta Update. The adaptation rate in the task update of meta training phase is larger than the meta learning rate of the meta update, which allows MIND to adapt faster. We use smaller adaptation learning rate for neural Turbo decoder, due to its sensitive iterative decoding structure with shared model weights [33].
The data and computation cost for the meta testing phase is shown in Figure 2. The task update step refers to the number of gradient steps required before testing on the new channel. Here we use the trained batch size . Fine-tuning neural decoder without MIND to adapt to new channel requires steps (each step need blocks) to converge. Compared to the fine-tuning, MIND only requires or gradient steps to conduct fast adaptation, with far less pilot data during the meta test phase. In what follows for the evaluation of MIND’s performance, MIND- refers to MIND with gradient update steps in the meta testing phase.
III MIND Performance
In this section, we investigate the performance of MIND- for convolution code and turbo code against several benchmarks.
III-A Channel Settings and Benchmarks
The channels used in this paper are:
- •
AWGN channel: , .
- •
Additive T-distribution Noise (ATN) channel: , where .
- •
Radar Channel: . where is a background AWGN noise, and , with probability is the radar noise with high variance and low probability. .
III-B Benchmarks
For both the convolutional code and turbo code, we compare MIND- decoder against the following benchmarking decoders:
- •
Canonical Optimal Decoders for AWGN Channel: For convolutional code, Viterbi algorithm has optimal BER performance for AWGN channels [38]. For Turbo code, iterative Turbo decoder based on BCJR shows capacity-approaching performance. When decoding on AWGN channels, the above two decoders serve as useful benchmarks to be compared against.
- •
Adaptive Neural Decoders: Under non-AWGN channels, generally there doesn’t exist a close-form optimal decoding scheme. On the other hand, in these cases, neural decoders outperform most state-of-the-art heuristic decoders [33]. Adaptive Neural Decoders are trained with nearly infinite data and computing resources on a particular channel and thus provide another useful benchmark especially for the non-AWGN channels.
- •
Multi-task Learning (MTL) based Decoders: This is a benchmark for naive adaptation, termed as MTL-, which updates weights via -step gradient descent directly from MTL trained weight (Eq. 4), with the same adaptation data batch size and learning rate as MIND-.
III-C MIND- for Convolutional Code
We evaluate the fast adaptation ability under 4 different channels shown in Figure 3: (1) AWGN channel, (2) Radar Channel ( and ), (3) ATN (), and (4) untrained Radar (). The first three channels aim at testing the fast adaptation ability on meta-trained channels, where the fourth channel aims at testing learning ability on unexpected channel with dramatically different parameters.
The MIND performance on Convolutional Code shows on trained channels:
- •
Among static methods without adaptation ability, MIND-0 and MTL-0 show similar performance. MIND without adaptation still performs well.
- •
MIND-1 performs better than MTL-1, MIND-0, and MTL-0. MTL-1 shows a degradation indicating that a naive learning via average performance on all channels is not stable.
To show the continued learning property on untrained channel, we also consider MIND-10 to compare. Here we observe:
- •
MIND-1 outperforms MTL-1, MTL-0, MIND-0. On untrained channel, MIND still shows improvement with solely gradient.
- •
MIND-10 outperforms MTL-1.On untrained channel, apply more gradient steps can further improve performance.
III-D MIND- for Turbo Code
As MTL-1 performs poorly, in this section we ignore MTL-1. On Turbo code, the channels tested shown below in Figure 4 are: (1) trained Radar channel (), and (2) untrained Radar channel ().
The performance on MIND with neural Turbo decoder shows the same trend as with Convolutional Code. The performance of MIND is consistent for both neural decoders as follows:
- •
Without adaptation ability, MIND-0 shows robust performance, comparable to neural decoder trained on multiple channels.
- •
With limited data and computation, MIND-1 outperforms static methods and shows performance close to optimal or adaptive algorithms.
- •
On untrained channels, applying MIND with more gradient steps continually improves accuracy.
Comparing to deploying MTL-trained neural decoders, MIND shows comparable performance without adaptation ability, and can conduct fast adaptation with minimal re-training on both trained and untrained channels. For further detailed discussion as well as experiments on other channels, please refer to the appendix.
IV Discussion
While we have designed MIND particularly for convolutional and Turbo codes, the methodology is not limited to these codes. In fact, the overall methodology is independent on the code structure or the neural network architecture, and thus can be adapted with equal felicity to other neural-based decoding problems. We note that MIND is not expected to be a universal decoder for all channels, rather that the learnt initialization is good for a class of channels which are related to the archetypal channels. A precise characterization of this class is an interesting direction for future research. Furthermore, MIND still requires more samples than maybe available in a typical training channel. We expect neural method for joint channel estimation and data detection to perform better - this is left for future work.
Among future directions, it is worth considering to combine other neural decodes with MIND, such as neural LDPC [29] [30] and Polar [32] decoders. Beyond neural decoder design, MAML can also be applied to Channel Autoencoder [27] design, which deals with designing adaptive encoder and decoder. MAML is a growing area of interest in terms of its standalone research [10] [8] [9] [41] [5], with promising directions combining with online learning [42]. These can usher new directions of remarkable improvements in decoder design.
-A Deep Learning Based Neural Decoders
In this section, we discuss the neural decoders for Convolutional Codes and Turbo Codes. We start with a small primer on Recurrent Neural Network (RNN) and Gated Recurrent Unit (GRU).
-A1 RNN and its variants
Near-optimal neural decoders for both Convolutional Codes and Turbo Codes are based on Recurrent Neural Network (RNN) [33]. RNN takes previous hidden state and as the input, and outputs the current output and the current hidden state for the next time slot, defined as . To use both the information from the past and the future, bidirectional RNN (bi-RNN) combines both forward and backward RNNs, defined as . This is illustrated in Figure 5 [24].
Since vanilla RNN is hard to train due to exploding and diminishing gradients, Gated Recurrent Unit (GRU) is used as the primary network structure for neural decoders [40]. Bi-GRU uses the gating scheme as shown in Figure 5 down, and is relieved of both the exploding and the diminishing gradients. In this paper, we use Bidirectional GRU (bi-GRU), GRU version of bi-RNN, as our primary neural structure.
-A2 Neural Decoder for Convolutional Codes
Convolutional Code has theoretical optimal Viterbi and Bahl-Cocke-Jelinek-Raviv (BCJR) decoder under AWGN channel setting [38] [39]. Inspired by the forward-backward structure of BCJR decoder, bi-GRU based neural decoder [33] matches the optimal Block Error Rate (BLER) (BCJR) and Bit Error Rate (BER) (Viterbi) performance under AWGN channel and outperforms existing heuristic algorithms under non-AWGN channels. More details about bi-GRU shown in appendix. The structure and hyper-parameter settings are shown in Figure 6 right two graphs. The convolutional encoder used in [33] is a Recursive Systematic Convolutional (RSC) Code with generator sequence and , thus RSC encoder is represented by .
-A3 Neural Decoder for Turbo Codes
Capacity-approaching Turbo code, as an extension of the convolutional code, can be near-optimally decoded by bi-GRU based neural decoders [33]. The neural Turbo decoder structure is shown on Figure 7, where the N-BCJR blocks are pre-trained neural BCJR decoders with shared weights. Neural Turbo Decoder matches the performance of the state-of-the-art Turbo decoder under AWGN channel, while shows better performance under non-AWGN channels when compared to the widely-used heuristic algorithms [33]. The Turbo code use the RSC encoder , same as that for the Convolutional Code. The number of decoding iterations is 6. The neural N-BCJR decoder has the same design as shown in Figure 6 left third, except that the input shape changes to due to the inclusion of the likelihood bits. Further details regarding the implementation of RNN-based neural decoder can be found in [33].
-B MIND Algorithm
-C MIND Performance
-C1 MTL is hard and unstable to adapt
In section III, we conducted MTL with the same adaption learning rate as MIND. In this section we test MTL with different adaption learning rate , with and . Shown in Figure 8, MIND-1 outperforms both MTL-1 and MTL-10 with different learning rates. Since MTL is not trained to conduct fast adaption, MTL- is very unstable. A naive application of the gradient descent on MTL leads to unstable and degrading performance. MIND learns to adapt with a large learning rate.
-C2 MIND on Fading Channels
Fading channel can be represented as , where is the fading component, is the additive noise component as shown in the Section III. is taken to be normalized i.i.d Fast Rayleigh Fading, i.e. , where and are independent standard Gaussian random variable. Further, normalizing with gives . We decode under coherent detection scheme when both and are fed to the decoder (inputs shape becomes instead of ). When testing the adaption ability of additive noise channels on fading channels, the fading component is fixed. We test MIND under the same 2 different channels: (1) trained Radar channel (), and (2) untrained Radar Channel and , shown below with shown in Figure 9.
The result on fading channel shows same trend as in non-fading scenario. On both channels, MIND-1 outperforms MIND-0 and MTL-0, and on untrained channel, MIND-10 outperforms MIND-1.
-C3 MIND on Diversified Training Channel Set
In Section III, we meta-train the neural decoders with the following set of training channels:
- •
AWGN channel.
- •
ATN with and
- •
Radar with , .
This is a somewhat less diversified training channel set, which contains closely distributed channels, which makes it possible to learn a neural decoder that works well on all the training channels. We want to test the performance on a more diversified training channel set. Towards this end, we use the following, which has channel parameters spanning a larger variation scale:
- •
AWGN channel.
- •
ATN with and
- •
Radar with , .
To test the learning ability of MIND under untrained channel, we use a testing channel set with both the trained channels mentioned above, and the following channels not in the training channel set, with more diversified channel parameters:
- •
ATN with
- •
Radar with , .
The performance on MIND trained with more diversified training channel set is shown in Figure 10. MIND still outperform MTL. MIND can handle training channel sets with a larger scale of diversity.
-D Discussion on MIND Hyperparameters
We need to design MIND with proper hyper-parameters to control the trade-offs among data-efficiency, computations, and adapting stability. Three major hyper-parameters affect the performance of MIND: adaption batch size , the test adaption steps , and adaption learning rate . We empirically examine the effects of the above three hyper-parameters on neural Convolutional Code decoder as follows:
-D1 Adaption batch size
The adaption batch size depends on the amount of available pilot data sampled from the new channel, which determines data-efficiency for MIND adaption. The performance between different adaption batch size is shown in Figure 11, which are trained on Radar channel () and ATN channel(), and untrained Radar channel ().
On trained channels, different adaption batch sizes show similar performance close to optimal/adaptive methods. However on untrained ATN channel(), shows significantly improvement comparing to . With small adaption batch size , MIND trained model tends to only learn model which works well for all trained channels, without adaption ability. Only when the adaption batch size is large enough, MIND starts to utilize the data sampled from the new channel. Different adaption batch size reveals the trade-off between data-efficiency and adaption ability.
-D2 Adaption steps
The test adaption steps depends on the limitations of computation resources, which determine the computation efficiency for MIND. The effect of adaption steps is also shown in Figure 11. Note that on trained channel, adapting with steps and adapting with step show similar performance. However, on untrained channel, adapting with more steps improves the performance. The experiment shows that it is beneficial to conduct more adaption steps with MIND when testing on untrained channels.
-D3 Adaption Learning rate
The adaption learning rate , controls the trade-off between stability and adapting speed. The performance of different adaption learning rate is shown in Figure12, on trained AWGN channel, and untrained Radar channel ().
High adaption learning rate shows worse performance on AWGN channel as shown in Figure12 left, while outperforms on untrained Radar channel shown in Figure12 right. The experiment shows that adaption learning rate controls the adapting aggressiveness of MIND. When trained with higher adaption learning rate , MIND learns to aggressively adapt with data from new channel, improves adapting ability on a new channel with sacrificing the performance on trained channels. On the other hand, with small adaption learning rate , MIND learns to conduct a somewhat conservative adaption.
Optimal adaption learning rate depends on the use case. When the testing channel is similar to the training channel set, using smaller adaption learning rate is more favorable. When testing channel is very different comparing to training channel, a higher adaption learning rate is preferred.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Tse, David, and Pramod Viswanath. Fundamentals of wireless communication. Cambridge university press, 2005.
- 2[2] Proakis, John G. Communication systems engineering. Vol. 2. New Jersey: Prentice Hall, 1994.
- 3[3] Sesia, Stefania, Matthew Baker, and Issam Toufik. LTE-the UMTS long term evolution: from theory to practice. John Wiley Sons, 2011.
- 4[4] Vanschoren, Joaquin. "Meta-learning: A survey." ar Xiv preprint ar Xiv:1810.03548 (2018).
- 5[5] Riemer M, Cases I, Ajemian R, Liu M, Rish I, Tu Y, Tesauro G. Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference. ar Xiv preprint ar Xiv:1810.11910. 2018 Oct 29.
- 6[6] Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." International Conference on Machine Learning (ICML 2017)
- 7[7] Finn, Chelsea, and Sergey Levine. "Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm." International Conference on Learning Representations (ICLR 2018)
- 8[8] Finn C, Xu K, Levine S. Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems 2018 (pp. 9537-9548).
