TL;DR
This paper introduces a deep learning-based quantization method for L-values in Gray-coded modulation, achieving significant memory reduction with minimal performance loss and demonstrating universality across different channel models.
Contribution
A novel autoencoder-based quantization scheme for L-values that reduces memory footprint and is adaptable across various channel conditions without retraining.
Findings
Reduces memory by up to 50% compared to state-of-the-art methods.
Maintains performance loss below 0.1 dB with less than two bits per L-value.
Demonstrates universal applicability across different channel models without retraining.
Abstract
In this work, a deep learning-based quantization scheme for log-likelihood ratio (L-value) storage is introduced. We analyze the dependency between the average magnitude of different L-values from the same quadrature amplitude modulation (QAM) symbol and show they follow a consistent ordering. Based on this we design a deep autoencoder that jointly compresses and separately reconstructs each L-value, allowing the use of a weighted loss function that aims to more accurately reconstructs low magnitude inputs. Our method is shown to be competitive with state-of-the-art maximum mutual information quantization schemes, reducing the required memory footprint by a ratio of up to two and a loss of performance smaller than 0.1 dB with less than two effective bits per L-value or smaller than 0.04 dB with 2.25 effective bits. We experimentally show that our proposed method is a universal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSolana Customer Service Number +1-833-534-1729
Deep Learning-Based Quantization of L-Values for Gray-Coded Modulation††thanks: This work has been accepted for presentation at Globecom 2019. Supported by grants ONR N00014-19-1-2590, ARO W911NF-18-1-0247, NSF CNS 1731384, and a gift from NXP Semiconductors Inc., Austin, Texas.
Marius Arvinte, Sriram Vishwanath, and Ahmed H. Tewfik
Department of Electrical and Computer Engineering
University of Texas at Austin
Austin, Texas 78712
Email: [email protected]
Abstract
In this work, a deep learning-based quantization scheme for log-likelihood ratio (L-value) storage in general fading scenarios affected by interference is introduced. We analyze the dependency between the average magnitudes of different L-values and show they follow a consistent ordering, regardless of the channel coefficient or interference distribution. Based on this we design a deep autoencoder that jointly compresses and separately reconstructs each L-value, allowing the use of a weighted loss function that aims to more accurately reconstruct low magnitude inputs. Our method is shown to be competitive with state-of-the-art maximum mutual information quantization schemes, reducing the required memory footprint by a ratio of up to two and achieving a loss of performance smaller than dB with less than two effective bits per L-value or smaller than dB with effective bits. We experimentally show that our proposed method is a universal compression scheme in the sense that after training on an LDPC-coded Rayleigh fading scenario we can reuse the same network without further training on other channel models and codes while preserving the same performance benefits.
I Introduction
Deep learning has recently gained a foothold in wireless communications and signal processing, with experimental results showing that it can be used for various tasks ranging from channel code design [1] to black-box communication schemes [2]. At the same time, quantization of information in communication systems is critical for applications such as feedback, relay, or hydrid automatic repeat request schemes, which require long-term storage of information. Motivated by finding a quantization scheme that is as greedy as possible with minimal impact to the end-to-end performance, we introduce a deep learning-based L-value vector quantization scheme that leverages a statistical ordering of the average magnitudes of different bit positions to weigh its loss function during training.
Log-likelihood ratio quantization is a well studied topic, where it has been identified that the optimal formulation aims to maximize mutual information [3]. Prior work in [4] presents a data-driven approach for quantizing L-values even in cases when the channel has an arbitrary or intractable distribution. The work in [5] introduces a maximum mutual information vector quantization method for L-values by using the Lloyd algorithm with a KL divergence metric. In both cases, a sharp drop in performance is exhibited when the effective storage size approaches or goes below two bits per L-value.
More recently, deep networks have been used for quantization and codebook learning, where the main challenge is back-propagating gradients through the quantization function. Solutions to this include soft-to-hard approximations of the nearest neighbor function that are parameterized by an increasing attraction coefficient over time or replacing the null gradient with other approximations. The work in [6] uses such a schedule during the training phase to learn a compression scheme for high resolution images that outperforms state-of-the-art schemes by almost an order of magnitude. With the same goal, the work in [7] introduces the architecture of vector quantized variational autoencoder (VQ-VAE) to learn a discrete, compressed representation of image and speech signals by approximating the null gradient with a linear function.
Even though these solutions are shown to be successful for images, in the case of communications the signal statistics are fundamentally different and any application requires careful analysis and design of the network architecture and loss function. Previous work in [8] uses an autoencoder structure to compress the channel state information (CSI) matrix by exploiting its time and spatial domain correlation. In this work, we propose to exploit the use of binary reflected Gray coding (BRGC), almost universally adopted in practical systems and known to be optimal [9] and show that it induces an ordering on the average magnitude of the L-values that holds for any channel coefficient distribution. We leverage this to design a deep autoencoder network that uses a branched decoder architecture to individually reconstruct each L-value, allowing us to weigh the loss function towards smaller magnitude L-values.
Our work is closest to [10], where a deep autoencoder is used to jointly compress and reconstruct the set of L-values corresponding to a channel use by leveraging the fact that three sufficient statistics will recover them. However, their work exhibits a performance gap even when the compressed signal (latent representation) is not quantized. By leveraging the statistics of L-values, we manage to virtually close this performance gap, reducing it from dB to as low as dB and consequently improving the quantized results as well. We compare our work with the results in [10] and state-of-the-art maximum mutual information schemes and show that we exhibit a compression gain of up to two, allowing us to use an effective storage size of less than two bits per L-value for high order modulation schemes. Furthermore, we experimentally show that the same network weights have generalization properties and can be used for a wide range of scenarios with different channel models and codes, without requiring any further training or adjustments.
II System Model
Consider the digital baseband model of a binary reflected Gray coded (BRGC) -QAM modulation scheme, where the transmitted symbol is obtained by mapping the bits to a complex symbol belonging to the constellation , where . We assume there are a number of interfering symbols denoted by , each independently drawn from a constellation . Under a flat fading complex channel model the complex received symbol has the expression
[TABLE]
where is drawn from a circular complex Gaussian distribution with zero mean and variance equal to and and represent the complex channel coefficients of the desired signal and each interferer, respectively. Given complete channel state information, the exact log-likelihood ratio (L-value) at the receiver for bit is given by
[TABLE]
Assuming that the prior probability of the interference vector is uniform, we can simplify the expressions and expand each conditional probability as
[TABLE]
Carrying an analysis similar to that in [10] and factoring out the term from the exponents we derive the sufficient statistics required to exactly reconstruct the set of L-values as
[TABLE]
Counting each complex value as two real ones, it follows that the number of real-valued sufficient statistics is equal to for a scenario with interferers. The special case corresponds to the case where there is no interference, or, more interestingly, when there is no information about the interference channel gains . In this case, the unknown interference is effectively treated as noise.
We further consider the conditional distributions of the L-values under the max-log approximation for different bit positions . The work in [11] shows that in a nonfading (i.e., affected only by noise) scenario the probability density function (PDF) of the L-values for each bit level can be written as a sum of truncated Gaussian PDFs. According to [11], the approximate PDF of the -th L-value conditioned on the transmitted bit has the expression
[TABLE]
where represents the Gaussian PDF and is the uniform probability that is a Gaussian with mean and variance given by and respectively, in concordance with the Zero-Crossing Model approximation in [11]. As the authors of [11] show, this leads to a key observation about the L-values corresponding to different bit positions, namely that after splitting the bits in two groups (due to the real/imaginary symmetry induced by BRGC) the bits occupying the first positions are more robust to channel conditions than later ones. Formally, this can be expressed by the inequality
[TABLE]
where the expectation is taken across the noise and a similar ordering holding true for the second half of L-values associated with the imaginary part. Finally, let be the soft bit associated to the L-value , given by [3]
[TABLE]
Since is a monotonic, increasing function it follows that the ordering in (6) holds for as well. We let and denote the -dimensional real, ordered – according to (6) – vectors of L-values and soft bits corresponding to a single channel use with and their -th elements, respectively. Our goal is to compress to a three-dimensional latent representation, quantize it to a finite, small number of bits, and reconstruct the original input.
The basic machine learning structure we use for compressing the L-values is a deep neural network. From a high-level perspective, a deep neural network can be viewed as a parameterized function , where is the real-valued input vector and denotes the weight vector, containing the serialized weights of all layers. For the rest of this work, we only refer to feedforward, fully-connected neural networks, in which a layer that takes as input the vector and outputs implements the operation
[TABLE]
where represent the weights and biases associated with a layer and is the element-wise activation function and all dimensions are consistent with matrix-vector multiplication. Typical activation functions include the rectified linear unit (ReLU) given by and the hyperbolic tangent . Finally, an autoencoder is a deep neural network that is trained to reconstruct its own input . This is commonly achieved by performing gradient updates on the weights in order to minimize the empirical risk function
[TABLE]
where represents the -th training sample and can be chosen to any distance or quasi-distance function. Importantly, we note that gradient-based approaches are incompatible with the quantization of hidden activations in deep networks, since the gradient becomes null almost everywhere after quantization, preventing preceding weights from being updated.
III The Proposed Scheme
While the original derivation is performed for a nonfading scenario, additional conditioning on the channel realization and averaging ensures the ordering of L-values holds even for an arbitrary fading distribution. We formulate and prove the following proposition.
Proposition 1
The ordering in (6) holds for any arbitrary distribution of .
Proof:
We prove this for the interference-free case, but the proof can be extended for interference, by taking the double integral into account. First we note that by (4) knowledge of is sufficient to characterize the distribution of L-values, since the phase can always be corrected assuming full CSI is available. By letting and expanding the PDF of the -th L-value we obtain
[TABLE]
where is the PDF of . Considering , it follows that (5) holds for any fixed , thus the ordering (6) holds pointwise, thus it holds for any . ∎
The previous result is proven for the max-log approximation of the L-values but empirically holds for the complete expression as well. In the case of an interference-free Rayleigh fading channel, Figure 1 illustrates this property by plotting four of the eight distributions of the L-values for a 256-QAM scenario, where it can be observed that the latter bits have a lower average absolute reliability than the earlier ones. This asymmetry of the different bit locations affects the performance of forward error correction, especially in the mid-high signal-to-noise ratio regime, where accurate reconstruction of low magnitude L-values is shown to be critical for correct decoding [12].
Equations (4) and (6) motivate the architecture of our compression and reconstruction network. We use an autoencoder with a latent representation of dimension , a joint encoder and a branched decoder (i.e., one, smaller, deep neural network is used to separately reconstruct each soft bit). The architecture of our solution is shown in Figure 2. The network takes as input the vector of soft bits and feeds them to the encoder, where the compressed latent representation is output. Letting be the encoder part of the network and each of the bit decoders, the -th reconstructed soft bit can be expressed as
[TABLE]
To account for accurate reconstruction of low magnitude soft bits, we adopt two measures:
We use a sample-wise weighted mean squared error function as in [10]. The loss function between the -th soft bit and its reconstruction has the expression
[TABLE]
where is used for numerical stability. This formulation ensures that more importance is given to soft bits with low values inside the same bit position. 2. 2.
We use a set of real weights to weigh the contributions of each soft bit to the total loss function leading to the expression
[TABLE]
where the weights are normalized to satisfy , thus at least one weight must be strictly positive.
Note that the weights are not applied per sample, but rather per soft bit and ensure that more importance is placed on reconstructing certain bit positions. By using Proposition 1 and the ordering in (6), we order the weights in decreasing order of reliability by
[TABLE]
Once the weights are set, the architecture is jointly trained for a number of epochs. Since we are using a single encoder, all gradient updates are averaged (with the weights factored in) when is updated, while are individually updated for each of the constituent decoders. We note that a different training regime is also possible where only the weights of specific decoders are updated if the performance after joint training (as measured by the loss function applied component-wise) is not good enough on specific soft bits. In fact, the scheme is completely modular in terms of the decoders, meaning that we can replace any of them with other reconstruction methods once the encoder is fixed.
Once training is complete, we obtain a universal, compressed representation of the soft bits in the form of , which needs to be quantized, stored and reconstructed during inference. Letting be the -bit codebook in the latent space the quantization function is given by
[TABLE]
The use of a minimum distortion quantization in the latent space is justified by prior work showing that over-parameterized deep neural networks are naturally robust when their hidden activations are quantized [13]. Finally, reconstruction of the original L-value vector is performed by applying the decoders for each soft bit on the quantized latent representation with .
IV Performance Results
IV-A Architectural and Training Details
We use a number of three hidden layers for the encoder and each of the decoders with a universal intermediate output size of , except for the latent representation which has a dimension of three as discussed in Section III. Storing weights in 32-bit floating precision leads to a total memory footprint (considering the quantization codebook as negligible) lower than KB for (-QAM), and scaling on the order of , thus rapidly decreasing for lower order modulations.
All hidden activations are ReLU, except for the latent representation and the outputs, which come from activations. During training, we also add a small amount of additive white Gaussian noise with zero mean and to the latent representation before it is decoded to encourage generalization and robustness to numerical quantization. Since the latent representation comes from the activation, each of its elements is bounded to the interval and the architecture learning trivial solutions such as boosting the power of the latent representation to overcome the added noise.
We use the AMSGrad version of the Adam algorithm for minimizing the empirical risk function in (13) with a batch size of samples, learning rate of and recommended parameters [reddi2019convergence] and split the training in two phases. The entire procedure is summarized in Algorithm 1. In the first phase the encoder and all decoders are jointly trained starting with equal weights . After a number of epochs the average reconstruction error is computed for all soft bits in the training data and we update the weights as
[TABLE]
This process is repeated for a number of rounds , placing a higher importance on soft bits that have larger average reconstruction errors. We empirically observe that this procedure always respects the order induced by (14) and also converges to a steady state where the weights no longer need an update. Alternatively, if a closed form expression of the distribution of is available and can be numerically evaluated one can use a fixed set of weights inversely proportional to .
In the second phase we freeze the weights of the encoder and continue individual training for each decoder for a number of epochs. Here we leverage the branched architecture to improve the performance of individual decoders without affecting the learned representation. We explored options where is equal or proportional to the stationary , but both cases lead to roughly the same performance. Since each decoder only updates its own weights, this process is fully parallelizable among them.
Once training is completed, we apply a mini-batch version of the k-Means algorithm [15] to independently obtain a non-uniform scalar quantizer for each of the three components of the latent representation. Note that this has the advantage of drastically minimizing the storage requirements of the codebook versus vector quantization of the latent space and is also empirically observed to offer similar or better performance. The training data consists of L-values computed using (LABEL:eq_lvalue) and generated from the coded bits of a number of LDPC codewords with a length of and rate transmitted over a Rayleigh fading channel with . The complete source code, pretrained networks and all performance results are available online111https://github.com/mariusarvinte/deep-llr-quantization.
IV-B Impact of Quantization on Block Error Rate
Throughout this section, we show the results obtained for (-QAM), but the architecture can be readily used for any modulation scheme. Since the number of quantized values is a constant w.r.t. , the scheme offers more efficient compression for high-order modulation mappings. The first experiment involves investigating the performance of our scheme in terms of block error rate (BLER) in the same conditions in which the training data is generated. We generate LDPC codewords for each signal-to-noise ratio in the set dB, corresponding to a high-mid noise power regime, concatenate the data points and shuffle them. Once training is completed, we design the k-Means quantizers using the latent representation of the same dataset. The empirical marginal distributions of the three components of are shown in Figure 3 together with their quantization codebooks. For validation, we generate a set of unseen LDPC codewords across a slightly wider signal-to-noise ratio range, encode them, quantize the latent representation with the trained codebooks and reconstruct them, followed by LDPC decoding using belief propagation with iterations.
We compare the performance of our method with the maximum mutual information quantization in [4], where we train a reference quantization for each separate bit position to account for large order modulation schemes, as well as use an initial codebook constrained in the to further boost performance. Additionally, we include the full precision (unquantized latent representation) results from [10] to show that we successfully cover the performance gap coming from the autoencoder reconstruction. Figure 4 shows the obtained block error rate curves for all the schemes, as well as the unquantized performance. Comparing at BLER we notice that we achieve the same performance with bits instead of , leading to a compression ratio of times and a loss of performance smaller than dB when compared to full precision storage, while using an equivalent of bits per L-value. As a contrast, using the scheme in [4] with bits per L-value leads to a performance loss of up to dB.
IV-C Generalization Performance
We investigate the performance of the proposed quantization scheme when applied to different testing configurations than the one we trained the network with. Figure 5 shows the validation performance of the network trained in Section IV-B when the channel is a frequency domain representation of a multitap fading extended typical urban (ETU) channel, corresponding to an OFDM scenario with a carrier frequency of GHz, bandwidth MHz and subcarriers (one for each QAM symbol). Each LDPC codeword experiences a different realization of the ETU channel and we use a random, but fixed interleaver for extra robustness. The relative gain of our method versus the maximum mutual information quantizer is fully preserved both for and bits, even though the network and codebooks are not further trained or adjusted in any way for this particular scenario.
Additionally, we investigate the performance of the same network when applied to a Polar-coded Rayleigh fading scenario. We simulate a Polar code of length and rate used for the New Radio (NR) control channels, decoded with the successive list cancellation algorithm with a list size of . Figure 6 plots the BLER performance on codewords obtained with the same network, where we notice that our method retains its advantages. This leads us to the claim that the learned quantization scheme is universal, in the sense that it exhibits the same performance regardless of the type of channel or channel code used and does not require any further training whatsoever.
IV-D Discussion
The performance of our scheme can be further enhanced if we take into account that it is not required to have the bit resolutions of all latent space components equal when performing quantization. Indeed, judging by Figure 3 it appears that the second component is more sensitive to scalar quantization since its distribution has a higher entropy. For example, given a budget of bits, we expect that allocating them to as instead of, say, will lead to less overall degradation of the end-to-end performance. Because of space constraints and to not make the results too crowded we omit this result, but this is indeed the case. This reasoning can be further extended to subsets of latent components if vector quantization is applied in the latent space, but we leave this for future research.
V Conclusions
In this work we have introduced a universal log-likelihood ratio (L-value) compression and quantization method that uses a deep autoencoder with a branched decoder, quantizes and reconstructs the latent representation of the set of L-values corresponding to a channel use. The branched decoder architecture allows us to more accurately reconstruct low-magnitude L-values, which are critical for successful decoding under greedy quantization.
Our results show that we can afford quantization with less than two effective bits per L-value for 256-QAM modulation, regardless of the type of channel model or code used with a loss smaller than dB in terms of BLER. In the high performance regime, we achieve losses smaller than dB with an effective bits per L-value. The algorithm can be used for any modulation scheme (with better gains achieved for higher order schemes) and is competitive with state-of-the-art maximum mutual information quantization algorithms, achieving a compression factor of up to two times for the same accuracy.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] H. Kim, Y. Jiang, S. Kannan, S. Oh, and P. Viswanath, “Deepcode: Feedback codes via deep learning,” in Advances in Neural Information Processing Systems , 2018, pp. 9436–9446.
- 2[2] T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Transactions on Cognitive Communications and Networking , vol. 3, no. 4, pp. 563–575, 2017.
- 3[3] W. Rave, “Quantization of log-likelihood ratios to maximize mutual information,” IEEE Signal Processing Letters , vol. 16, no. 4, pp. 283–286, 2009.
- 4[4] A. Winkelbauer and G. Matz, “On quantization of log-likelihood ratios for maximum mutual information,” in 2015 IEEE 16th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC) . IEEE, 2015, pp. 316–320.
- 5[5] M. Danieli, S. Forchhammer, J. D. Andersen, L. P. Christensen, and S. S. Christensen, “Maximum mutual information vector quantization of log-likelihood ratios for memory efficient harq implementations,” in 2010 Data Compression Conference . IEEE, 2010, pp. 30–39.
- 6[6] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for end-to-end learning compressible representations,” in Advances in Neural Information Processing Systems , 2017, pp. 1141–1151.
- 7[7] A. van den Oord, O. Vinyals et al. , “Neural discrete representation learning,” in Advances in Neural Information Processing Systems , 2017, pp. 6306–6315.
- 8[8] C. Lu, W. Xu, H. Shen, J. Zhu, and K. Wang, “Mimo channel information feedback using deep recurrent network,” IEEE Communications Letters , vol. 23, no. 1, pp. 188–191, 2018.
