Towards Hardware Implementation of Neural Network-based Communication   Algorithms

Fay\c{c}al Ait Aoudia; Jakob Hoydis

arXiv:1902.06939·cs.IT·February 20, 2019

Towards Hardware Implementation of Neural Network-based Communication Algorithms

Fay\c{c}al Ait Aoudia, Jakob Hoydis

PDF

TL;DR

This paper explores the practical implementation of neural network-based communication algorithms on hardware like FPGAs and ASICs, focusing on fixed-point quantization to enable real-time, efficient inference with minimal performance loss.

Contribution

It demonstrates that neural network algorithms can be effectively quantized and implemented in fixed-point arithmetic on hardware, bridging the gap between simulation and practical deployment.

Findings

01

Fixed-point neural network inference achieves negligible performance loss.

02

Hardware implementation compatible with FPGAs and ASICs.

03

Quantization reduces complexity while maintaining accuracy.

Abstract

There is a recent interest in neural network (NN)-based communication algorithms which have shown to achieve (beyond) state-of-the-art performance for a variety of problems or lead to reduced implementation complexity. However, most work on this topic is simulation based and implementation on specialized hardware for fast inference, such as field-programmable gate arrays (FPGAs), is widely ignored. In particular for practical uses, NN weights should be quantized and inference carried out by a fixed-point instead of floating-point system, widely used in consumer class computers and graphics processing units (GPUs). Moving to such representations enables higher inference rates and complexity reductions, at the cost of precision loss. We demonstrate that it is possible to implement NN-based algorithms in fixed-point arithmetic with quantized weights at negligible performance loss and with…

Tables1

Table 1. TABLE I : Number of additions required by the quantized NN -based receiver and ML receiver

	For any $K$	$K = 14$	Complexity – $K = 14$
ML receiver	$2048 (K - 1)$ $+ 3840$	$30464$	$100 %$
NN-based receiver	$10496$	$10496$	$34.5 %$

Equations12

z = sign (z) s 2^{e}

z = sign (z) s 2^{e}

z = sign (z) (i = 0 \sum K_{I} - 1 B (z)_{i} 2^{i} + i = 1 \sum K_{F} B (z)_{- i} 2^{- i})

z = sign (z) (i = 0 \sum K_{I} - 1 B (z)_{i} 2^{i} + i = 1 \sum K_{F} B (z)_{- i} 2^{- i})

ψ ar g min (L (ψ) + \frac{μ}{2} ψ - ψ - \frac{1}{μ} λ_{2}^{2})

ψ ar g min (L (ψ) + \frac{μ}{2} ψ - ψ - \frac{1}{μ} λ_{2}^{2})

ψ \in C^{P} ar g min (ψ - \frac{1}{μ} λ - ψ_{2}^{2})

ψ \in C^{P} ar g min (ψ - \frac{1}{μ} λ - ψ_{2}^{2})

C_{W} = {0, \pm 2^{q} ∣ q \in Z, ∣ q ∣ < K - 1},

C_{W} = {0, \pm 2^{q} ∣ q \in Z, ∣ q ∣ < K - 1},

SNR : - \frac{E { \frac{1}{N} ∥ x ∥ _{2}^{2} }}{σ ^{2}} = \frac{e _{s}}{σ ^{2}}

SNR : - \frac{E { \frac{1}{N} ∥ x ∥ _{2}^{2} }}{σ ^{2}} = \frac{e _{s}}{σ ^{2}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Towards Hardware Implementation of Neural Network-based Communication Algorithms

Fayçal Ait Aoudia and Jakob Hoydis

Nokia Bell Labs

{faycal.ait_aoudia, jakob.hoydis}@nokia-bell-labs.com

Abstract

There is a recent interest in neural network (NN)-based communication algorithms which have shown to achieve (beyond) state-of-the-art performance for a variety of problems or lead to reduced implementation complexity. However, most work on this topic is simulation based and implementation on specialized hardware for fast inference, such as field-programmable gate arrays (FPGAs), is widely ignored. In particular for practical uses, NN weights should be quantized and inference carried out by a fixed-point instead of floating-point system, widely used in consumer class computers and graphics processing units (GPUs). Moving to such representations enables higher inference rates and complexity reductions, at the cost of precision loss. We demonstrate that it is possible to implement NN-based algorithms in fixed-point arithmetic with quantized weights at negligible performance loss and with hardware complexity compatible with practical systems, such as FPGAs and application-specific integrated circuits (ASICs).

I Introduction

Inspired by the success of deep learning (DL) in various fields such as computer vision and natural language processing (NLP), NN-based communication systems have gained a lot of attention recently. Approaches leveraging DL have lately emerged for channel coding [1], multiple-input multiple-output (MIMO) systems [2], orthogonal frequency-division multiplexing (OFDM) [3], and so forth. However, most of these contributions are simulation-based, and only a few have considered hardware implementation of DL-based approaches as well as the issues it raises. Indeed, one of the biggest obstacle to the use of NNs is their high memory requirement and computational complexity. Hardware acceleration is needed to achieve reasonable inference time, and most of previous contributions leverage graphics processing units (GPUs), which come at high monetary and energy cost not viable for communication systems. Indeed, inference speed on the physical layer are on the order of micro to nanoseconds, which is at least one order of magnitude faster than in other applications, e.g., autonomous cars.

Recently, multiple methods were proposed to compress NNs to reduce their complexity, such as weight quantization [4], weight pruning [5], or more efficient implementations of the conventional floating-point operators [6]. Nevertheless, such methods were mostly considered for flagship machine learning tasks, such as image classification and speech recognition. In this paper, we consider the efficient implementation of NN-based communication algorithms in fixed-point arithmetic. Our aim is not to reduce the memory footprint of the NN by learning a compressed representation of the architecture, but its computational complexity as we assume the NN is implemented in hardware. Fixed-point compute units are faster and consume less hardware resources and energy than conventional floating-point units [4, 6]. We consider the implementation of an NN-based receiver as shown in Fig. 1. However, the approach used in this paper can be applied to a wide variety of NN-based communication algorithm, e.g., fully learned transceiver implementations as done in [7]. The NN-based receiver trained with no constraints and uncompressed enables block error rates (BLERs) close to the ones of maximum likelihood (ML) detection as shown in Fig. 2. In this figure, we assume a transmitter implementing the Agrell [8] scheme with eight bit blocks and four channel uses transmitting over an additive white Gaussian noise (AWGN) channel. We aim to quantize the weights of the NN-based receiver so that they take values in a finite codebook, enabling a more efficient implementation. A straightforward approach to quantize the NN-based receiver is to first train it with no constraints, and then to quantize its weights. This naive approach usually leads to significant performance loss as shown in Fig. 2, which motivates the use of quantization-aware algorithms. In this paper, we leverage a two-stages approach to achieve efficient implementation without significant performance loss:

First, the weights are quantized using the learning-compression (LC) algorithm [9] so that they take values in a predefined finite codebook. Training using LC is done on regular high precision floating-point arithmetic. 2. 2.

Next, the quantized NN is implemented on a fixed-point arithmetic system, less precise than the floating-point system used to train it, but with a reduced complexity.

We show that the quantized NN-receiver trained on a 32 bit floating-point arithmetic system but implemented on a 14 bit fixed-point arithmetic system enables BLERs close to the ones of ML detection, while being more than $60\%$ less complex.

To the best of our knowledge, only a few papers have considered implementations of NN-based communication algorithms. NN-based transceivers [7] were implemented on field-programmable gate array (FPGA) in [10]. The implementation of a DL-based modulation classifier on FPGA was described in [11]. However, in neither of these contributions did the authors attempt to reduce the model complexity to make inference more efficient. In [12], a recurrent NN was considered to decode polar codes. After each training epoch, the weights were quantized in a two-step process: the weights were first rounded to the nearest fixed-point value that can be represented by a predefined number of bits, before being assigned values from a codebook based on the frequency with which rounded weights appear. The authors showed that weight quantization enables significant decrease of the memory footprint and computational complexity, without significant performance loss. However, evaluation of the NN in fixed-point arithmetic was not performed.

Notations: Boldface upper- and lower-case letters denote matrices and column vectors, respectively. $\mathbb{R}$ and $\mathbb{C}$ respectively denote the sets of real and complex numbers. $\mathbb{Z}$ denotes the set of integer.

II Background on NN compression

II-A Fixed-point arithmetic

Conventional hardware used in machine learning, such as GPUs, rely on floating-point arithmetic. With this scheme, real numbers that can be represented exactly are of the form

[TABLE]

where $s$ and $e$ are integers called the significand and the exponent, respectively, and $\operatorname{sign}$ is the sign function. The numbers of bits used for the significand and the exponent control the range and precision of the representation. A floating-point number is stored as the sign bit, the exponent field, and the significand field. A key feature of floating-point representation is that it does not form a uniformly-spaced grid, as the spacing between consecutive numbers grows with the exponent. This enables the representation of numbers of widely different orders of magnitude. The main drawback of floating-point arithmetic is the complexity of its compute units, which require a lot of hardware resources, energy, and time compared to other schemes, such as fixed-point arithmetic. Indeed, performing an operation (such as an addition or multiplication) usually requires preprocessing of the operands and post-processing of the result if their exponents are different.

Regarding fixed-point arithmetic, only real numbers that can be written as

[TABLE]

can be represented. $K_{I}$ and $K_{F}$ are non-negative integers that correspond to the number of bits of the integer and fractional parts, respectively. Notice that (2) corresponds to writing $z$ in the binary numeral system and constraining the number of bits allowed for the integer and fractional parts to finite values. One can see that, contrary to floating-point representation, representable numbers form a uniformly-spaced grid whose range and precision are controlled by $K_{I}$ and $K_{F}$ , respectively. Two consecutive numbers are spaced by $2^{-K_{F}}$ . A fixed-point number is typically stored using $K=K_{I}+K_{F}+1$ bit (an additional bit is required to handle negative numbers), as shown in Fig. 3. The number is represented as an $K$ bit integer, with an implicit factor $2^{-K_{F}}$ that does not need to be stored as it is fixed. With regard to complexity, fixed-point operators are typically of low complexity compared to floating-point operators as no additional processing steps needs to be taken, which motivates their use to reduce the memory footprint and computational requirements of NNs, e.g. [13]. We adopt the same approach in this paper, by implementing an NN-based receiver in fixed-point arithmetic.

II-B The LC algorithm

Moving to fixed-point arithmetic is not the only way to reduce the resources required by an NN. Another approach used in this paper conjointly with implementation in fixed-point arithmetic is quantization of the weights. Quantizing the weights of an NN means forcing them to take values in a discrete codebook.

It is well-known that multiplication is significantly more computationally demanding than addition in fixed-point arithmetic. Indeed, a fixed-point multiplication of $K$ bit operands requires up to $K$ bit shifts and $K-1$ additions. By forcing the weights to take values in a well-chosen codebooks, the cost of multiplications can be drastically reduced. For example, using the codebook $\{-1,0,1\}$ reduces multiplications to zeroing or sign changes. Also, choosing the codebook to be a set of powers of two reduces multiplication to bit shifts in fixed-point arithmetic, as multiplication by $2^{q}$ is equivalent to moving the radix point $q$ digits to the left or right depending on the sign of $q$ , as illustrated in Fig. 3.

A key question is how to train an NN while forcing its weights to take values in a given codebook. Let us denote by $f_{\boldsymbol{\psi}}$ the mapping implemented by an NN with parameters $\boldsymbol{\psi}\in\mathbb{R}^{P}$ , $L(\boldsymbol{\psi})$ the loss function, ${\cal C}$ the quantization codebook, and $\widehat{\boldsymbol{\psi}}\in{\cal C}^{P}$ the quantized weights. A naive approach is to first train the NN with no constraints on its weights by solving $\arg\min_{\boldsymbol{\psi}}\left(L(\boldsymbol{\psi})\right)$ and then to quantize the model by choosing for each weight its closest value in the codebook by solving $\arg\min_{\widehat{\boldsymbol{\psi}}\in{\cal C}^{P}}\left(\left\lVert\widehat{\boldsymbol{\psi}}-\boldsymbol{\psi}\right\rVert^{2}_{2}\right)$ . This simple approach, referred to as direct compression (DC), typically does not lead to satisfactory results. To circumvent this issue, compression-aware algorithms are typically used. One such algorithm is LC [9], in which compression of an NN is considered as a constrained optimization problem, solved by applying alternating optimization to the augmented Lagrangian. LC is guaranteed to converge to a local optimum in some cases [9, Section 3.2]. Each iteration of LC performs two steps, a learning step and a quantization step. The learning step updates the unquantized weights $\boldsymbol{\psi}$ by solving

[TABLE]

where $\mu$ is a parameter of the algorithm, which is increased at each iteration following a predefined schedule, and $\boldsymbol{\lambda}$ is the Lagrange multiplier estimate. One can see that a regularization term is added to the loss, which ensures that the unquantized weights stay close to $\widehat{\boldsymbol{\psi}}+\frac{1}{\mu}\boldsymbol{\lambda}$ . The compression step updates the quantized weights $\widehat{\boldsymbol{\psi}}$ by solving

[TABLE]

which corresponds to the quantization of $\boldsymbol{\psi}-\frac{1}{\mu}\boldsymbol{\lambda}$ to the codebook ${\cal C}$ . The LC algorithm is depicted in Algorithm 1. The parameter $\mu$ is typically increased following a multiplicative schedule $\mu^{(k)}=a\mu^{(k-1)}$ , where $\mu^{(k)}$ is the value of $\mu$ at the $k$ th iteration and $a$ is a parameter of the algorithm larger than one. $\boldsymbol{\psi}$ and $\widehat{\boldsymbol{\psi}}$ are initialized by training the NN without any constraints and then quantizing the weights (lines 1 and 2). In practice, the learning step (line 5) is approximatively solved using stochastic gradient descent (SGD) or a variant.

III Quantization of NN-based receiver

In this section, efficient implementation of an NN-based receiver is achieved using a two-stages approach. First, quantization of the NN-based receiver is performed by training it with LC on a usual floating-point arithmetic system. Next, the quantized NN is implemented and evaluated on a fixed-point system. While this paper focuses on an NN-based receiver, this approach can be applied to a wide variety of NN-based communication algorithms.

III-A NN-based receiver

In a point-to-point communication system, two nodes aim to reliably exchange information over a stochastic channel as shown in Fig. 1. The output of the channel $\mathbf{y}$ follows a probability distribution conditional to its input $\mathbf{x}$ , i.e., $\mathbf{y}\sim P(\mathbf{y}|\mathbf{x})$ . The transmitter aims to communicate messages $m$ drawn from a finite set $\mathbb{M}=\{1,\dots,M\}$ , while the task of the receiver is to detect the sent messages $m$ from the received signal $\mathbf{y}$ . The receiver is implemented as an NN (see Fig. 1) $f_{\boldsymbol{\theta}_{R}}^{(R)}:\mathbb{C}^{N}\mapsto\left\{\mathbf{p}\in\mathbb{R}_{+}^{M}|\sum_{i=1}^{M}p_{i}=1\right\}$ , where $\boldsymbol{\theta}_{R}$ is the set of parameters and $N$ the number of channel uses. Its purpose is to estimate the conditional probability $P(m|\mathbf{y})$ , which corresponds to a supervised learning task. Once trained, the receiver can be deployed for practical use.

A communication system operating over an AWGN channel is considered in this work, with $M=256$ and $N=4$ . The transmitter implements the Agrell scheme [8], a subset of the E8 lattice designed by numerical optimization to approximately solve the sphere packing problem for $M=256$ in eight dimensions (corresponding to four channel uses). Normalization is performed to ensure that ${\mathbb{E}}\left\{\frac{1}{N}\left\lVert\mathbf{x}\right\rVert^{2}\right)=e_{s}$ , where $e_{s}$ is the energy per complex symbol.

III-B Architecture of the NN-based receiver

The receiver is implemented by a $\mathbb{C}2\mathbb{R}$ layer, mapping the $N$ received complex symbols to $2N$ real numbers, followed by a dense layer of 64 units with ReLu activation, a dense layer of 32 units with ReLu activation, and finally a dense layer of $M$ units with softmax activation, as shown in Fig. 4. All the dense layers but the last use biases. Hard decoding is performed by taking the message with highest probability.

Regarding implementation considerations, the ReLu activation was chosen as it requires minimal overhead. Indeed, its implementation requires neither approximation using a look-up table nor arithmetic operations. Therefore, it does not incur computational overhead nor arithmetic errors due to approximation. Moreover, implementation of the output layer softmax activation is not required at deployment, as hard decoding can be performed based on the pre-activations.

III-C Weight quantization

Quantization of the NN-based receiver was done by training the NN using the LC algorithm presented in Section II-B, on usual GPUs with floating-point arithmetic. Different codebooks were used for the weights and biases. Regarding the weights, the codebook was

[TABLE]

where $K=K_{I}+K_{F}+1$ . The choice of this codebook was motivated by the much higher complexity of multiplications compared to additions. Accordingly, with this codebook, all multiplications are reduced to either zeroing or bit shifting. Moreover, multiplications by $2^{q}$ with $\mathopen{|}q\mathclose{|}\geq K-1$ lead to zeroing on a $K~{}$ bit fixed-point system. Therefore, the codebook was restricted to powers of two with exponent less than $K-1$ in absolute value. Biases were constraint to take values in the codebook defined by the set of fixed-point numbers with $K_{I}$ bit for the integer part and $K_{F}$ bit for the fractional, i.e., the set of real numbers that can be represented as in (2).

To evaluate the impact of receiver quantization on the BLER, comparison was done between the ML receiver, the unquantized, LC- and DC-quantized NN-based receiver. When quantization was performed, $K_{I}$ was set to 5 as it was experimentally found to be the smallest value large enough to avoid overflows on a fixed-point system. Trainings and evaluations were performed for values of $K_{F}$ of 2, 4, 8, and 12 bit. The signal-to-noise ratio (SNR) is defined as

[TABLE]

where $\sigma^{2}$ is the per-complex noise symbol variance, and the equality results from the energy constraint ensured by the transmitter normalization layer. $\sigma^{2}$ was set to $-80\>$ dB, and the SNR was controlled by setting $e_{s}$ . Evaluations were done using the Tensorflow [14] framework and training with the Adam [15] variant of SGD. Fig. 5 shows the BLER achieved by the compared schemes for SNR values ranging from $-2\>$ dB to $11\>$ dB. Evaluations were performed on a floating-point system. Only results with $K_{F}=8\>$ bit are shown for readability, as other values of $K_{F}$ lead to almost identical BLERs. One can see that quantization using the naive DC approach leads to higher error rates than quantization with LC. Moreover, quantizing the NN-based receiver using LC leads to BLERs close to the ones achieved by the not quantized NN-based receiver and ML detection.

III-D Impact of fixed-point arithmetic

This section investigates the impact on the BLER of implementing the quantized NN-receiver on a fixed-point arithmetic system. The quantized NNs trained on GPUs with LC for $K_{I}=5$ bit and different values of $K_{F}$ were implemented on a fixed-point system with the corresponding number of bits allocated to the integer and fractional part. Fixed-point arithmetic was simulated in Python. As all the weights are powers of two, all multiplications were reduced to bit shifts. Implementation of the ReLu activation function is straightforward. The softmax activation of the output layer was not implemented, as hard decoding can be performed based on the pre-activations. Fig. 6 shows the BLER achieved by the receiver for different values of $K_{F}$ . It can be seen that using only 2 or 4 bit for the fractional part leads to significant increase of the error rate, while using 8 bit (or more) leads to no BLER degradation. These result shows that it is possible to implement the NN-based receiver in 14 bits fixed-point arithmetic with no BLER degradation, despite the fact that it was trained on a 32 bits floating point arithmetic system.

III-E Complexity evaluation

It was shown in the previous section that an NN-based receiver with weights taking as values powers of two and implemented on a fixed-point arithmetic system can achieve BLER close to the ones of ML detection. In this section, we compare the computational complexity of the previously evaluated ML receiver and quantized NN-based receiver.

Regarding the NN-based receiver, only multiplications and additions are required. Moreover, multiplications only involve layers inputs and weights. Because the weights are quantized to take as values powers of two, all the multiplications required by the NN correspond to bit shifts. As the weights are assumed to be fixed after deployment, the bit shifts can be “hardwired” in the hardware implementation, removing the need for storing the weights in memory, as well as programmable bit shifters.

The ML-receiver is assumed to be implemented by measuring the squared Euclidean distance of the received signal with each of the $M$ possibly sent signals, and taking the closest. Therefore, it requires squaring operations, i.e., multiplications. On a fixed-point system, each multiplication requires $K-1$ additions as well as $K$ bit shifts, these latter being assumed to have a negligible complexity compared to additions. Accordingly, only additions are considered to compare the complexities of the implementations. Complexities of the considered schemes are therefore evaluated by comparing the number of required additions. Table I shows the number of additions required by the quantized NN-based receiver and the ML receiver, for which each multiplication was counted as $K-1$ additions. Notice that the complexity of an addition depends on how it is implemented, and of the number of bits $K$ used in the fixed-point system. As one can see, the quantized NN-based receiver requires approximately $60\%$ less additions than the ML receiver with $K=14\>$ bits, without incurring significant BLER degradation as seen in the previous section. This encouraging result illustrates how NN-based approaches have the potential to significantly reduce the complexity of communication systems, without significant loss of performance.

IV Conclusion

We presented in this paper an approach to reduce the implementation complexity of NN-based communication algorithms. Considering an NN-based receiver as example, complexity reduction was achieved by quantizing the weights so that they take as values powers of two, reducing all multiplication to bit shifts in fixed-point arithmetic. Compared to naive direct compression, this approach incurs almost no BLER increase, while enabling significant gain in computational complexity. Our results show that the compressed NN-based receiver achieves BLERs close to the ones of ML detection, while enabling $60\>\%$ gains in computational complexity when implemented on a 14 bits fixed-point arithmetic system.

We believe that future work on quantization, compression, and more broadly, the efficient hardware implementation of NNs for physical layer tasks is required by our community before machine learning-based solutions can make it into commercial products.

Acknowledgment

The authors thank Luc Dartois for comments that greatly improved the manuscript.

Bibliography15

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] F. Liang, C. Shen, and F. Wu, “An Iterative BP-CNN Architecture for Channel Decoding,” EEE J. Sel. Topics Signal Process. , vol. 12, no. 1, pp. 144–159, Feb. 2018.
2[2] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO detection,” in IEEE 18th Int. Workshop on Signal Process. Advances in Wireless Commun. (SPAWC) , July 2017.
3[3] H. Ye, G. Y. Li, and B. Juang, “Power of Deep Learning for Channel Estimation and Signal Detection in OFDM Systems,” IEEE Wireless Commun. Lett. , vol. 7, no. 1, pp. 114–117, Feb. 2018.
4[4] H. Li, S. De, Z. Xu, C. Studer, H. Samet, and T. Goldstein, “Training Quantized Nets: A Deeper Understanding,” in Advances in Neural Inform. Process. Syst. , Dec. 2017.
5[5] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, and E. Choi, “Morphnet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks,” in IEEE Conf. on Comput. Vision and Pattern Recognition (CVPR) , June 2018.
6[6] J. Johnson, “Rethinking Floating Point for Deep Learning,” ar Xiv preprint ar Xiv:1811.01721 , Nov. 2018.
7[7] T. O’Shea and J. Hoydis, “An Introduction to Deep Learning for the Physical Layer,” IEEE Trans. on Cogn. Commun. Netw. , vol. 3, no. 4, pp. 563–575, Dec. 2017.
8[8] E. Agrell, “Database of Sphere Packing,” https://codes.se/packings/8.htm, accessed: 2018-07-27.