Deep Unfolding for Communications Systems: A Survey and Some New   Directions

Alexios Balatsoukas-Stimming; Christoph Studer

arXiv:1906.05774·eess.SP·October 9, 2019

Deep Unfolding for Communications Systems: A Survey and Some New Directions

Alexios Balatsoukas-Stimming, Christoph Studer

PDF

TL;DR

This paper surveys the deep unfolding technique, which combines iterative algorithms with neural networks, highlighting its applications in communication systems like MIMO detection, precoding, and decoding, and discusses future research directions.

Contribution

It provides a comprehensive overview of deep unfolding in communications and introduces new directions for applying this method to various communication tasks.

Findings

01

Deep unfolding improves detection and decoding in MIMO systems.

02

The method demonstrates versatility across multiple communication tasks.

03

Open research problems are identified for future exploration.

Abstract

Deep unfolding is a method of growing popularity that fuses iterative optimization algorithms with tools from neural networks to efficiently solve a range of tasks in machine learning, signal and image processing, and communication systems. This survey summarizes the principle of deep unfolding and discusses its recent use for communication systems with focus on detection and precoding in multi-antenna (MIMO) wireless systems and belief propagation decoding of error-correcting codes. To showcase the efficacy and generality of deep unfolding, we describe a range of other tasks relevant to communication systems that can be solved using this emerging paradigm. We conclude the survey by outlining a list of open research problems and future research directions.

Figures2

Click any figure to enlarge with its caption.

Tables1

Table 1. TABLE I: Summary of unfolded learned algorithms for MIMO detection and MIMO precoding.

Reference	[9]	[10]	[11]	[12]	[13] and [14]	[15]
Task	Detection	Detection	Detection	Detection	Detection	Precoding
Algorithm	PGD	PGD	PGD	OAMP	PGD	PGD
Parameters	$T (12 M \times 2 K + 2 K + 6 M)$	$T (12 M \times 2 K + 2 K + 4 M)$	$T (8 M \times 2 K + 2 K + 4 M)$	$2 T$	$2 T + 1$	$2 T$

Equations55

\tilde{y}

\tilde{y}

\hat{x}

\hat{x}

y

y

y^{T}

y^{T}

x^{T}

z_{t}

z_{t}

\hat{x}_{t + 1}

\hat{v}_{t + 1}

z_{t}

z_{t}

\hat{x}_{t + 1}

\hat{v}_{t + 1}

ψ (x)

ψ (x)

r_{t}

r_{t}

\hat{x}_{t + 1}

v_{t}^{2}

τ_{t}^{2}

r_{t}

r_{t}

\hat{x}_{t + 1}

\tilde{y}

\tilde{y}

z_{t + 1}

z_{t + 1}

v_{t + 1}

m_{t}^{v \to c}

m_{t}^{v \to c}

m_{t}^{c \to v}

m_{t}^{c \to v}

m_{t}^{v}

m_{t}^{v}

\overset{u}{^}_{t}^{v}

\overset{u}{^}_{t}^{v}

m_{t}^{v \to c}

m_{t}^{v \to c}

\overset{u}{^}_{t}^{v}

\overset{u}{^}_{t}^{v}

m_{t}^{c \to v} =

m_{t}^{c \to v} =

\times v^{'} \in N (c) ∖ v min ∣ m_{t}^{v \to c} ∣.

m_{t}^{v \to c}

m_{t}^{v \to c}

α_{t}^{v \to c} \cdot v^{'} \in N (c) ∖ v min max (∣ m_{t}^{v \to c} ∣ - β_{t}^{v \to c}, 0),

α_{t}^{v \to c} \cdot v^{'} \in N (c) ∖ v min max (∣ m_{t}^{v \to c} ∣ - β_{t}^{v \to c}, 0),

L_{sbe} (a, b)

L_{sbe} (a, b)

L_{bce} (a, b)

L_{bce} (a, b)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Deep Unfolding for Communications Systems:

A Survey and Some New Directions

Alexios Balatsoukas-Stimming1,2 and Christoph Studer3 The work of ABS was supported by the Swiss NSF project PZ00P2_179686. The work of CS was supported in part by Xilinx Inc. and the US NSF under grants ECCS-1408006, CCF-1535897, CCF-1652065, CNS-1717559, and ECCS-1824379. The authors would like to thank O. Castañeda \oscapulpo and T. Goldstein for discussions on deep unfolding. 1Telecommunications Circuits Laboratory, École polytechnique fédérale de Lausanne, Lausanne, Switzerland

2Electronic Systems Group, Eindhoven University of Technology, Eindhoven, The Netherlands ([email protected])

3School of Electrical and Computer Engineering, Cornell University, Ithaca, USA ([email protected])

Abstract

Deep unfolding is a method of growing popularity that fuses iterative optimization algorithms with tools from neural networks to efficiently solve a range of tasks in machine learning, signal and image processing, and communication systems. This survey summarizes the principle of deep unfolding and discusses its recent use for communication systems with focus on detection and precoding in multi-antenna (MIMO) wireless systems and belief propagation decoding of error-correcting codes. To showcase the efficacy and generality of deep unfolding, we describe a range of other tasks relevant to communication systems that can be solved using this emerging paradigm. We conclude the survey by outlining a list of open research problems and future research directions.

Index Terms:

Machine learning, channel coding, massive MIMO, deep unfolding.

I Introduction

A large number of signal processing tasks in communications systems, such as detection and decoding, can be formulated as optimization problems. These optimization problems are typically solved using numerical algorithms that iteratively refine the solution. Most practical communications applications require solving of these problems at high throughput and low latency, which implies that one can afford only a very small number of algorithm iterations (e.g., ten or fewer). In order to find accurate solutions with a small number of iterations, numerical solvers require careful parameter tuning (e.g., step-size selection). While the numerical optimization literature has focused extensively on analyzing convergence rates and stability given step-size conditions [1], only very little is know about optimal parameter tuning under stringent iteration constraints. In practice, the algorithm parameters are typically set using heuristics (e.g., tuned by hand using simulations) or pessimistic bounds (e.g., given by the Lipschitz constant). However, such conventional approaches are prone to result in suboptimal performance and may cause stability issues if the system conditions change (and the parameters would need to be adapted in real time).

I-A Model-Driven Neural Networks via Deep Unfolding

In recent years, neural networks (NNs) have been proposed to replace a range of signal processing tasks in communications systems [2, 3, 4, 5, 6]. While the performance of such NN-assisted methods is promising in many applications, they suffer from the following drawbacks: (i) high computational complexity and memory requirements, and (ii) virtually no performance guarantees are available. As an alternative to such black-box methods, model-driven NNs are becoming increasingly popular in communications systems [7]. The idea of model-driven NNs is to fuse principled algorithms that have performance guarantees with tools from NNs, with the goal of combining the best of both worlds. Deep unfolding [8] is a powerful instance of such model-driven NNs and is also rapidly gaining popularity in the communications community.

In the words of the authors of [8], deep unfolding can be summarized as follows: “[…] given a model-based approach that requires an iterative inference method, we unfold the iterations into a layer-wise structure analogous to a neural network.” Put simply, deep unfolding takes an iterative algorithm with a fixed number of iterations $T$ , unfolds its structure, and introduces a number of trainable parameters. These parameters are then learned using techniques from deep learning (with suitable loss functions, stochastic gradient descent, and back-propagation). The resulting unfolded algorithm with learned parameters can then be used to solve a range of tasks in communications systems. Deep unfolding has several practical advantages: (i) Existing performance guarantees for the original iterative algorithms may apply verbatim to learned unfolded networks and appropriate constraints can be imposed on the learned parameters. (ii) Most unfolded communications algorithms have a relatively small number of trainable parameters, which simplifies training. (iii) Unfolded algorithms are typically based on well-known methods for which efficient hardware implementations are readily available, which reduces design time. (iv) The resulting unfolded algorithms are often intuitive, interpretable, and have low complexity and memory requirements, which is in stark contrast to black-box NNs.

In this survey, we discuss several applications of deep unfolding to communications systems, with a particular focus on MIMO systems and belief-propagation-based decoding of error-correcting codes.111We note that our formulations of various algorithms may differ slightly from the notation used in the original papers for uniformity. We also outline other applications and discuss a number of interesting open research directions.

II Deep Unfolding for MIMO Systems

We now describe applications of deep unfolding to signal processing tasks in multiple-input multiple-output (MIMO) wireless systems. More specifically, we discuss recent results on MIMO detection [9, 10, 11, 12, 13, 14, 16, 17] and MIMO precoding [15], which are also summarized in Table I.

II-A MIMO Data Detection

The common baseband input-output relation of data transmission over a frequency-flat MIMO channel is as follows:

[TABLE]

Here, $\tilde{\mathbf{y}}\in\mathbb{C}^{N}$ is the complex-valued receive vector, $\tilde{\mathbf{H}}\in\mathbb{C}^{N\times M}$ is the $N\times M$ complex-valued MIMO channel matrix, $\tilde{\mathbf{x}}\in\mathcal{L}^{M}$ is the vector of transmit symbols, where $\mathcal{L}$ is the transmit constellation set, and $\tilde{\mathbf{n}}\in\mathbb{C}$ is a complex Gaussian noise vector distributed according to $\mathcal{CN}\!\left(0,\sigma^{2}\mathbf{I}\right)$ .

The goal of MIMO data detection is to compute an estimate $\hat{\mathbf{x}}$ of the transmitted data vector $\tilde{\mathbf{x}}$ , given the receive vector $\tilde{\mathbf{y}}$ , the channel matrix $\tilde{\mathbf{H}}$ , and knowledge of the statistics of $\tilde{\mathbf{n}}$ . Maximum likelihood (ML) data detection for this model amounts to solving the following optimization problem:

[TABLE]

Since the ML data detection problem is NP-hard, computationally efficient approximate methods are used in practice. To use NN-based methods for detection, one often operates on real-valued data using the real-valued decomposition of (1):

[TABLE]

where the quantities $\mathbf{y}\in\mathbb{R}^{2N}$ , $\mathbf{H}\in\mathbb{R}^{2N\times 2M}$ , $\mathbf{x}\in\mathbb{R}^{2M}$ , and $\mathbf{n}\in\mathbb{R}^{2N}$ , are defined as follows:

[TABLE]

The NN-based MIMO data detection methods in [9, 10] are obtained by unfolding the iterations of projected gradient descent data detection algorithms (e.g., the method in [18]) and enriching these algorithms with additional dimensions and trainable parameters. Specifically, in [9] the following updates are used for $t=0,\ldots,T{-}1$ :

[TABLE]

where $\text{ReLU}(x)=\max\{x,0\}$ , $\hat{\mathbf{x}}_{0}=\mathbf{0}$ , $\psi_{\mathbf{k}_{t}}(\cdot)$ is a soft sign operator parameterized by the trainable parameter vector $\mathbf{k}_{t}$ , and $\mathbf{z}_{t}$ is of dimension $2K>2M$ . The quantity $\mathbf{H}^{T}\mathbf{y}$ is included in the input vector of (6) because it is a sufficient detection statistic, while $\mathbf{H}^{T}\mathbf{H}\hat{\mathbf{x}}_{t}$ is an estimate of the (noiseless) sufficient detection statistic at iteration $t$ [9]. The set of trainable parameters is $\left\{\mathbf{W}_{1t},\mathbf{W}_{2t},\mathbf{W}_{3t},\mathbf{b}_{1t},\mathbf{b}_{2t},\mathbf{b}_{3t},\mathbf{k}_{t}:t=0,\ldots,T{-}1\right\}$ ; these parameters can be learned by specifying a suitable loss function and using tools from deep neural networks (such as stochastic gradient descent and back-propagation). Finally, hard decisions are obtained by computing $\hat{\mathbf{x}}=\text{sign}\left(\hat{\mathbf{x}}_{T}\right)$ .

The work of [10] mimics a projected gradient algorithm more closely, thus simplifying the algorithm iteration (6)–(8):

[TABLE]

where $\sigma(\cdot)$ is a logistic sigmoid, and the trainable parameters $\left\{\mathbf{W}_{1t},\mathbf{W}_{2t},\mathbf{W}_{3t},\mathbf{b}_{1t},\mathbf{b}_{2t},\mathbf{b}_{3t},\delta_{1t},\delta_{2t}:t=0,\ldots,T{-}1\right\}$ are reduced as the $\mathbf{W}_{1t}$ matrices are of lower dimension. To aid parameter learning, the cost function in [9, 10] is based on the outputs of all layers and a residual feature where the output of each layer is a weighted average with the output of the previous layer is also added. Details on how to obtain soft outputs and to extend the method to high-order constellations are provided in [10]. The unfolded data detectors are compared with a wide range of existing detectors in the literature as well as with a standard NN-based detector over different channel models and is shown to provide competitive performance.

To enable support for higher-order constellations, the work of [11] proposes to use a sum of shifted sigmoid functions:

[TABLE]

Here, the shifts $\tau_{i}$ are pre-defined based on the constellation set $\mathcal{L}$ and $A$ is a fixed offset. Moreover, two distinct NN-based detectors are trained with different initialization strategies and the best output is kept. It is also argued that only ML-detectable training samples should be used for training. Simulation results show that close-to-ML performance is achieved for a constellation with $L=5$ levels over a fixed (during training and data detection) MIMO channel with $N=M=8$ .

The work of [12] unfolds the iterations of an orthogonal approximate message passing (OAMP) detector [18], where only trainable scalars $\left\{\gamma_{t},\theta_{t}:t=0,\ldots,T{-}1\right\}$ are introduced. The following updates are used for $t=0,\ldots,T{-}1$ :

[TABLE]

where $\mathbf{W}_{t}$ is a function of the channel matrix $\mathbf{H}$ , $v_{t}^{2}$ , and $\sigma^{2}$ , and $\mathbf{C}_{t}$ is a function of $\mathbf{H}$ and $\mathbf{W}_{t}$ , as defined in [12]. Simulation results for Rayleigh and correlated channels demonstrate that data-driven tuning of $\gamma_{t}$ and $\theta_{t}$ can lead to significant performance improvements compared to standard OAMP.

The algorithms in [13, 14] target detection for massive overloaded MIMO channels, i.e., channels where $N\gg M$ . The proposed data detection algorithm is based on projected gradient descent for a total of $T$ iterations and with the introduced trainable scalars $\left\{\alpha,\gamma_{t},\theta_{t}:t=0,\ldots,T{-}1\right\}$ :

[TABLE]

where $\hat{\mathbf{x}}_{0}=\mathbf{0}$ and $\mathbf{W}=\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T}+\alpha\mathbf{I})^{{-}1}$ . The authors use incremental training to avoid vanishing gradient problems. Simulation results show that the trained projected gradient detector provides similar performance to other detectors but at a significantly lower complexity. Moreover, the trained projected gradient detector is also shown to perform well in traditional, i.e., non-overloaded, MIMO systems.

Very recently, the papers [16] and [17] used deep unfolding for MIMO data detection based on conjugate gradients and projected gradient descent, respectively; both methods achieve near-ML performance at low complexity.

II-B Multi-User (MU) MIMO Precoding

MU-MIMO precoding consists of multiplying the transmit symbol vector $\tilde{\mathbf{x}}$ with a precoding matrix $\mathbf{P}$ so that a suitably defined performance metric (e.g., the SNR at the receiver) is maximized. The complex-valued system model in (1) is

[TABLE]

where $\tilde{\mathbf{v}}=\tilde{\mathbf{P}}\tilde{\mathbf{x}}$ and the corresponding real-valued model of (3) using $\mathbf{y}$ , $\mathbf{H}$ , $\mathbf{v}$ , and $\mathbf{n}$ can be derived accordingly.

The results in [15] describe a projected gradient descent algorithm for precoding in massive MU-MIMO systems with $1$ -bit quantization at the transmitter. In this scenario, each element of the vector $\mathbf{v}$ is constrained to the binary set $\{{-}\upsilon,{+}\upsilon\}$ , where $\upsilon^{2}={\frac{P}{2M}}$ is selected to satisfy a transmit power constraint $P$ . The following updates with trainable scalars $\left\{\tau_{t},\rho_{t}:t=0,\ldots,T{-}1\right\}$ are used for a total of $T$ iterations:

[TABLE]

where $\mathrm{prox}_{g}(\mathbf{z};\rho_{t},\xi)=\mathrm{clip}\!\left(\rho\Re\{\mathbf{z}\},\xi\right)+j\mathrm{clip}\!\left(\rho\Im\{\mathbf{z}\},\xi\right)$ is the proximal operator, $\mathbf{A}=\left(\mathbf{I}-\mathbf{x}\mathbf{x}^{T}/\|\mathbf{x}\|_{2}^{2}\right)\mathbf{H}$ , and $\mathbf{v}_{T}$ is quantized to $\{{-}\upsilon,{+}\upsilon\}$ . Simulation results for a range of channel models show that learning suitable parameters $\tau_{t},\rho_{t}$ allows one to decrease the number of iterations $T$ by a factor of two for the same error-rate performance. The computational graph corresponding to the unfolded version of (20) and (21) along with the final quantization $Q(\cdot)$ and transmission over $\mathbf{H}$ is shown in Fig. 1.

III Deep Unfolding for Belief-Propagation-Based Channel Decoding

Belief propagation is an iterative message-passing algorithm that is commonly used to decode error-correcting codes. The message-passing strategy is typically described by a bipartite Tanner graph that represents the parity-check matrix of the code. These Tanner graphs consist of two types of nodes, namely variable nodes and check nodes. Each variable node is associated with a codeword bit and each check node is associated with a parity-check equation. Let $\mathcal{V}$ denote the set of variable nodes, $\mathcal{C}$ denote the set of check nodes, and $\mathcal{N}(x)$ denote the set of (one-hop) neighbors of a node $x$ . Then, for each $v\in\mathcal{V},c\in\mathcal{N}(v)$ , the variable-to-check messages $m_{t}^{v\rightarrow{c}}$ at iteration $t\in\{1,\ldots,T\}$ are:

[TABLE]

where $l^{v}$ denotes the channel log-likelihood ratio (LLR) for variable node $v$ and $m_{0}^{c\rightarrow{v}}=0$ by convention. Moreover, for each $c\in\mathcal{C},v\in\mathcal{N}(c)$ , the check-to-variable messages $m_{t}^{c\rightarrow{v}}$ at iteration $t$ are given by

[TABLE]

For each $v\in\mathcal{V}$ , the bit-decision metric is calculated as

[TABLE]

and final bit-decisions are generated as follows:

[TABLE]

In [19], the variable-to-check BP equation in (22) is modified by adding trainable weights $w_{t}^{v}$ and $w_{t}^{c^{\prime}}$ , which yields:

[TABLE]

Moreover, bit-decisions are generated using the following soft (i.e., differentiable) version of (25):

[TABLE]

The authors use a binary cross-entropy loss function that uses the $\hat{u}_{t}^{v}$ values from all iterations $t$ in order to aid learning and avoid vanishing gradient problems. The unfolded BP decoder is trained using synthetically-generated training data for a range of different signal-to-noise-ratio (SNR) values. Simulation results for a variety of BCH codes show that the unfolded BP decoder with learned weights significantly outperforms traditional BP decoders.

The methods in [20, 21] improve upon [19] by using a recurrent neural network (RNN) structure so that the weights $w_{t}^{v}$ and $w_{t}^{c^{\prime}}$ do not change over the iterations. Moreover, the authors use a technique called relaxation where consecutive messages are combined using learned weights, which improves the convergence behavior of the BP decoder. Moreover, [22, 21] simplify [19] by using normalized min-sum (MS) decoding for the check nodes with a learned parameter $w$ :

[TABLE]

The method in [23] uses an unfolded normalized MS algorithm for the decoding of polar codes. The main difference with [22, 21], apart from the slightly different message scheduling required to decode polar codes, is that the normalization parameter $w$ is allowed to differ for every message and for every iteration. Simulation results for polar codes of various block-lengths and rate $R=1/2$ show that the unfolded MS decoder with per-message learned normalization parameters outperforms the standard normalized MS decoder by approximately $0.5$ dB. The authors also provide a high-level discussion of hardware implementation considerations.

The authors of [24] propose a hybrid BP-NN decoder for polar codes, where a fraction of the messages is calculated using standard BP message-passing rules, while the remaining messages are calculated using trained NNs. This approach enables the scaling of NN-assisted decoders to large block-lengths, while simulation results show very competitive performance with respect to conventional decoders for polar codes.

The method in [25] unfolds the MS algorithm to decode polar codes. The authors first use a method to convert the message-passing graph of polar codes into a conventional sparse Tanner graph so that the standard BP message-passing rules of (22) and (23) can be used verbatim, thus avoiding the different message schedule used in [23, 24]. Moreover, a single weight $w^{\prime}$ is used for all variable-to-check messages at all iterations so that (26) is simplified to:

[TABLE]

The non-normalized MS update rule is used for the check-to-variable messages, i.e., (28) with $w=1$ . Simulation results show that the use of a single weight $w^{\prime}$ has a negligible effect on the error rate of the decoder, while significantly reducing the complexity of both learning and decoding.

The papers [26, 27] propose to unfold the normalized-offset MS algorithm to decode LDPC codes and polar codes, respectively. The minimum-finding part in (28) is replaced by:

[TABLE]

where $\alpha_{t}^{v\rightarrow{c}}$ and $\beta_{t}^{v\rightarrow{c}}$ are per-message and per-iteration trainable parameters. Simulation results show that these additional parameters can improve the performance of unfolded MS decoding with respect to standard MS and BP decoding as well as previous works on unfolded MS decoding.

In [28], a joint CRC-polar MS decoding algorithm is proposed, which exploits the concatenated factor graph of a polar code and a CRC. Similarly to previously described works, trainable weights are assigned to the edges of the unfolded factor graph. Moreover, a multi-loss cost function is used to improve the training process. In general, multi-loss functions are a sum of multiple loss functions from different parts of the NN to be trained. In this case, the multi-loss function takes the outputs of both the MS part and the CRC part of the factor graph into account. Simulation results show improved performance with respect to [21, 23].

The method in [29] uses an unfolded structure that resembles that of [21], with the main difference that a single weight is used for all messages and all iterations. The authors also argue that the binary cross-entropy function that is commonly minimized to train unfolded decoders does not necessarily minimize the bit error rate (BER). Instead, they propose a new cost function that is based on the so-called soft bit error concept. For a single bit, if the actual bit-value is $a\in\{0,1\}$ and the estimated (soft) bit-value at the output of the unfolded decoder is $b\in[0,1]$ , the soft bit error $L_{\text{sbe}}(a,b)$ is given by:

[TABLE]

whereas the standard binary cross-entropy $L_{\text{bce}}(a,b)$ would be:

[TABLE]

Instead of training for a single SNR or a set of SNR points, the authors of [29] use an auxiliary NN that learns parameter values given the SNR as an input.

Finally, the work of [30] proposes the idea of unfolding in order to learn finite-alphabet (FA) decoding of LDPC codes. In FA decoding, messages are quantized using a very small number of quantization bits and it is thus crucial that the quantization thresholds and levels are designed very carefully. The authors show that by unfolding and learning FA decoders, gains of up to $0.25$ dB can be achieved for a $(1296,972)$ QC-LDPC code when using $3$ quantization bits.

IV Deep Unfolding for Other Communications Applications

There exist a plethora of other communications applications in which the idea of unfolding has been used—we now briefly summarize some of these applications. For channel decoding that is not based on BP decoding, references [31, 32, 33] study unfolding of Turbo decoding, whereas [34] discusses successive cancellation decoding of polar codes. The work in [35] proposes to replace the channel-dependent parts of the Viterbi detection algorithm by a DNN. NNs have also been used extensive for non-linear signal processing tasks. In this case, unfolding does not refer to the iterations of some algorithm, but rather to the non-linear equations themselves (e.g., parallel Hammerstein model [36, 37] or Schrödinger wave equation). This approach has been recently applied to optical communications (e.g., [38, 39]) and to full-duplex communications (e.g., [40]). Finally, unfolding has been extensively applied to the iterative shrinkage-thresholding algorithm (ISTA) to solve sparse linear inverse problems [41, 42, 43, 44, 45, 46], which is a general tool that finds use in communications systems (e.g., for sparse channel estimation).

V Future Research Directions

Even though the idea of deep unfolding is relatively novel and many open research questions remain, it has already found wide applicability in communication systems and is likely to transform a range of other signal processing tasks. For example, proximal algorithms [47, 48] solve a wide range of optimization problems in communication systems and they are generally well-suited for unfolding. We conclude this brief survey by outlining some interesting future research directions.

V-A Unfolded Structures with Acceleration Methods

Several techniques that have been used in the optimization literature to accelerate the convergence of optimization algorithms can be incorporated into unfolded architectures with trainable parameters. Some examples of these techniques include preconditioning, momentum and Onsager terms, restart, and adaptive step-size rules. Such methods are particularly interesting for severely iteration-constrained applications where obtaining the fastest possible convergence is of the utmost importance. It may also be beneficial, from both a complexity and performance perspective, to derive and optimize unfolded structures that directly operate on complex-valued signals.

V-B Loss Functions

Novel application-tailored cost functions can improve the convergence of the training process. This can not only lead to better results for the same computational effort, but it may also enable real-time and online training of unfolded structures. Some examples of customized loss functions already exist (e.g., [29] uses a soft bit error function, [49] proposes a syndrome-based cost function, and [50] uses a cost function that is tailored to the quantum error-correction scenario), but they are mostly limited to channel decoding. All works on MIMO detection/precoding that we have described [9, 10, 11, 12, 13, 14, 15] use the standard mean-squared error (MSE) cost function. The MSE cost function has the advantage that closed-form solutions or very accurate iterative approximations can be derived in many cases. However, the MSE is not necessarily a good proxy for the error rate performance of the system. For example, in a multi-user MIMO setting, the system error rate will most likely be dominated by the user with the largest MSE. When optimizing an unfolded algorithm, even when the algorithm itself has been derived assuming an MSE cost function, it is typically easy to learn the optimal set of parameters for a different cost function. In the multi-user MIMO example mentioned previously, this could be the maximum MSE over all users, or even the maximum MSE over all users and all channel realizations in the training dataset.

V-C Training

Even though in many applications training can be carried out offline, it is still a task that requires considerable effort and thus deserves attention. In applications where it is sufficient to unroll a small number of iterations, training is mostly straightforward. However, when more iterations are considered, numerous problems arise. A common problem is that of vanishing gradients, where it becomes increasingly difficult to find suitable parameters in early iterations. This problem can be addressed by using multi-loss functions (like most of the works presented in this survey), incremental training [13], or by simply using a set of known good initial values to minimize training [15]. Another solution would be to perform windowed training, where unfolded iterations are trained only over a moving window of fixed size. This approach can also significantly reduce the memory required for training, which may become a limiting factor when considering algorithms with high-dimensional inputs (e.g., in massive MIMO) and a large number of iterations, since all intermediate output values of each mini-batch need to be stored for back-propagation. Online training methods that adapt to, e.g., changing channel or SNR conditions, is another important problem. Finally, it is often unclear what the best dataset for training is. We note that some preliminary works already focused specifically on this direction [51, 52], but more research is required.

V-D Hardware Implementation

Unfolded learned algorithms are particularly attractive from a hardware implementation perspective, as they strongly resemble known algorithms for which efficient hardware architectures already exist. However, the hardware implementation complexity aspect is typically not considered in the literature, with some notable exceptions being [23], where high-level hardware considerations for unfolded MS decoding are discussed, and [53], where FPGA and ASIC implementation results of the method in [40] are presented. As such, it remains largely unclear how the additional trainable parameters required by unfolded algorithms affect the hardware implementation complexity and the achieved throughput. Moreover, efficient hardware implementations of the training step are necessary for situations that require online learning.

Bibliography53

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. P. Bertsekas, Convex Optimization Algorithms . Athena Scientific, Belmont, Massachusetts, 2015.
2[2] T. O‘Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Transactions on Cognitive Communications and Networking , vol. 3, no. 4, pp. 563–575, Dec. 2017.
3[3] T. Wang, C. Wen, H. Wang, F. Gao, T. Jiang, and S. Jin, “Deep learning for wireless physical layer: Opportunities and challenges,” China Communications , vol. 14, no. 11, pp. 92–111, Nov. 2017.
4[4] Q. Mao, F. Hu, and Q. Hao, “Deep learning for intelligent wireless networks: A comprehensive survey,” IEEE Communications Surveys Tutorials , vol. 20, no. 4, pp. 2595–2621, Fourth Quarter 2018.
5[5] D. Gunduz, P. de Kerret, N. D. Sidiropoulos, D. Gesbert, C. Murthy, and M. van der Schaar, “Machine learning in the air,” Apr. 2019. [Online]. Available: https://arxiv.org/abs/1904.12385
6[6] Z. Qin, H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep learning in physical layer communications,” Feb. 2019. [Online]. Available: https://arxiv.org/abs/1807.11713
7[7] H. He, S. Jin, C.-K. Wen, F. Gao, G. Y. Li, , and Z. Xu, “Model-driven deep learning for physical layer communications,” Feb. 2019. [Online]. Available: https://arxiv.org/abs/1809.06059
8[8] J. R. Hershey, J. Le Roux, and F. Weninger, “Deep unfolding: Model-based inspiration of novel deep architectures,” Nov. 2014. [Online]. Available: https://arxiv.org/abs/1409.2574