Learned Image Compression with Soft Bit-based Rate-Distortion   Optimization

David Alexandre; Chih-Peng Chang; Wen-Hsiao Peng; Hsueh-Ming Hang

arXiv:1905.00190·eess.IV·May 2, 2019

Learned Image Compression with Soft Bit-based Rate-Distortion Optimization

David Alexandre, Chih-Peng Chang, Wen-Hsiao Peng, Hsueh-Ming Hang

PDF

Open Access

TL;DR

This paper proposes a novel soft bit-based method for learned image compression that improves rate-distortion optimization by enabling differentiable quantization and accurate rate estimation, leading to state-of-the-art results.

Contribution

Introduction of soft bits for differentiable quantization, enhancing rate-distortion optimization in learning-based image compression.

Findings

01

Achieves state-of-the-art MS-SSIM and PSNR performance.

02

Effectively couples rate estimation with context-adaptive coding.

03

Provides a differentiable distortion objective function.

Abstract

This paper introduces the notion of soft bits to address the rate-distortion optimization for learning-based image compression. Recent methods for such compression train an autoencoder end-to-end with an objective to strike a balance between distortion and rate. They are faced with the zero gradient issue due to quantization and the difficulty of estimating the rate accurately. Inspired by soft quantization, we represent quantization indices of feature maps with differentiable soft bits. This allows us to couple tightly the rate estimation with context-adaptive binary arithmetic coding. It also provides a differentiable distortion objective function. Experimental results show that our approach achieves the state-of-the-art compression performance among the learning-based schemes in terms of MS-SSIM and PSNR.

Figures32

Click any figure to enlarge with its caption.

Equations10

λ \times L_{R} (q) + L_{D} (x, \hat{x}),

λ \times L_{R} (q) + L_{D} (x, \hat{x}),

u(f)\coloneqq\left\{\begin{array}[]{ll}1&\mbox{if }f\geq 0\\ 0&\mbox{if }f<0\end{array}\right.\approx\sigma_{\alpha}(f)\coloneqq\frac{1}{1+e^{-\alpha f}}.

u(f)\coloneqq\left\{\begin{array}[]{ll}1&\mbox{if }f\geq 0\\ 0&\mbox{if }f<0\end{array}\right.\approx\sigma_{\alpha}(f)\coloneqq\frac{1}{1+e^{-\alpha f}}.

q_{1} (f)

q_{1} (f)

: = σ_{α} (f - 0.25) - σ_{α} (f - 0.5)

+ σ_{α} (f - 0.75) - σ_{α} (f - 1) .

\nabla_{θ_{e}} (- lo g p (\tilde{q}_{i} ∣ c t x)) = - \frac{1}{p ( q ~ _{i} ∣ c t x )} \frac{\partial p ( q ~ _{i} ∣ c t x )}{\partial q ~ _{i}} \frac{d q ~ _{i}}{df} \nabla_{θ_{e}} f .

\nabla_{θ_{e}} (- lo g p (\tilde{q}_{i} ∣ c t x)) = - \frac{1}{p ( q ~ _{i} ∣ c t x )} \frac{\partial p ( q ~ _{i} ∣ c t x )}{\partial q ~ _{i}} \frac{d q ~ _{i}}{df} \nabla_{θ_{e}} f .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Advanced Image Processing Techniques · Advanced Vision and Imaging

Full text

Learned Image Compression with Soft Bit-based Rate-Distortion Optimization

Abstract

This paper introduces the notion of soft bits to address the rate-distortion optimization for learning-based image compression. Recent methods for such compression train an autoencoder end-to-end with an objective to strike a balance between distortion and rate. They are faced with the zero gradient issue due to quantization and the difficulty of estimating the rate accurately. Inspired by soft quantization, we represent quantization indices of feature maps with differentiable soft bits. This allows us to couple tightly the rate estimation with context-adaptive binary arithmetic coding. It also provides a differentiable distortion objective function. Experimental results show that our approach achieves the state-of-the-art compression performance among the learning-based schemes in terms of MS-SSIM and PSNR.

**Index Terms— ** Autoencoder, Deep Learning, Image Compression, Soft Bits

1 Introduction

Learning-based image compression has recently attracted lots of attention due to the renaissance of deep learning. Unlike the traditional methods, the learning-based schemes can be adapted to any differentiable objective, opening up many optimization possibilities. For example, Li et al. [1] propose a content-weighted image compression model that performs region-adaptive compression via a learnable importance map.

Most learning-based methods[1, 2, 3, 4, 5, 6, 7, 8] rely on training an autoencoder end-to-end with the aim of striking a good balance between distortion and rate losses. Two challenges arise. First, the quantization process for lossy feature map compression causes zero gradients during the back-propagation process. Second, the rate loss is often painful to estimate accurately, as it is highly coupled with entropy coding, the operation of which is generally not differentiable.

Several prior arts are proposed to address these issues. Li et al. [1] overcome the zero gradients by a straight-through mechanism, which simply considers the quantizer to be an identity function during the back-propagation process. Agustsson et al. [5] and Mentzer et al. [3] introduce a non-uniform soft quantizer with a smooth mapping function as a surrogate of the hard quantizer. Ballé et al. [6, 7] and Theis et al. [8] adopt an additive noise model for the quantizer.

In comparison with the quantization issue, the rate estimation is even more challenging. Li et al. [1] use the sum of importance map features as a rough estimate of the rate. Theis et al. [8] estimate the rate from the upper-bound of non-differentiable number of bits. For better estimation, Ballé et al. [6, 7] and Minnen et al. [4] compute the differential entropy of the quantizer output based on the additive noise model. To bind the rate estimation tightly to the actual entropy coding, Mentzer et al. [3] use the context probability model implemented by PixelRNN [9] to compute the self-information of each coding symbol. Their scheme is, however, complicated due to the use of PixelRNN [9] and the non-binary arithmetic coding.

In this paper, we propose a learned image compression system with soft-bit-based rate-distortion optimization. It has the striking feature of combining effective coding tools from modern image codecs (e.g., uniform quantization, binary bitplane coding with on-the-fly probability updating, and simple context models) with the strong suit of deep learning (e.g., non-linear autoencoder). Moreover, we introduce the notion of soft bits to represent quantization indices of feature samples so that both rate and distortion losses can be estimated accurately in a differentiable manner. Experimental results show that our method achieves the state-of-the-art rate-distortion performance among the learning-based schemes.

The remainder of this paper is organized as follows: Section 2 describes the proposed method. Section 3 details the training procedure. Section 4 presents the experimental results. Section 5 concludes this work.

2 Proposed Method

This section details the framework of our image compression system, including the overall architecture, the operation of each component, and the modeling of compression rate and distortion for end-to-end training. Notation-wise, we use a bold letter (e.g., $\bm{x}$ ) to refer collectively to a high-dimensional tensor and a Roman letter (e.g., $x$ ) to denote its element in some order.

2.1 Overall Architecture

Fig. 1 illustrates our proposed framework. There are two data paths, one for operating the model in the test mode (that is, for putting it into use in practice) and the other for its training (i.e., training mode).

The data path in the test mode, as indicated by the solid arrow lines, begins with encoding an image $\bm{x}\in\mathbb{R}^{W\times H\times 3}$ of size $W\times H$ in 4:4:4 YUV format through a convolutional encoder $E(\bm{x};\theta_{e})$ into a compact set of feature maps $\bm{f}\in\mathbb{R}^{W/8\times H/8\times C}$ , of which each feature sample $f\in(0,1)$ is a real number. For lossy compression, $f$ is uniformly quantized by a $b$ -bit, power-of-two quantizer $Q$ , leading to a fixed-point binary representation $q=\lfloor f/2^{-b}\rfloor$ , where $2^{-b}$ is the quantization step size. That is, the quantization (output) index $q$ is the first $b$ significant bits of $f$ in its binary representation (e.g., $q=1100$ for $f=0.81,b=4$ ). Like most image compression systems, either learning-based or conventional, the quantization indices $\bm{q}$ are compacted further by lossless arithmetic coding. Motivated by JPEG2000 [10], we arrange $\bm{q}$ as bitplanes and perform context-adaptive bit-plane encoding/decoding (CABIC/CABID), of which we will discuss more in the following sections. To reconstruct the input $\bm{x}$ approximately, the feature sample is first recovered via inverse quantization (IQ) $\hat{f}=q/2^{b}$ , followed by convolutional decoding $\hat{\bm{x}}=D(\hat{\bm{f}};\theta_{d})$ . Currently, our encoder and decoder come from an autoencoder proposed in [3]; their parameters $\theta_{e},\theta_{d}$ are however learned by our training framework, which aims to strike a good trade-off between rate $L_{R}(\bm{q})$ and distortion $L_{D}(\bm{x},\hat{\bm{x}})$ by minimizing the following objective function with respect to $\theta_{e},\theta_{d}$ :

[TABLE]

where $L_{D}(\bm{x},\hat{\bm{x}})$ is defined to be a weighted sum of mean-square errors between YUV components of $\bm{x}$ and $\hat{\bm{x}}$ , with the error of Y component weighted 4 times that of the U/ V component.

The data path in training mode, as outlined by the dashed arrow lines, is designed for end-to-end model training. Training a learning-based compression system is often faced with two issues: (1) the quantization effect, which describes the stair-like mapping from $\bm{f}$ to $\hat{\bm{f}}$ , gives rise to zero gradients almost everywhere, and (2) the rate cost needed to achieve a rate-distortion optimized design is difficult to estimate accurately. To address these issues, we introduce the notion of soft bits $\tilde{\bm{q}}$ as an alternative to the hard bit representation of the quantization indices $\bm{q}$ . As an example, instead of rendering $q$ into ”1”,”1”,”0”,”0” for $f=0.81,b=4$ as done previously, we express these binary hard bits as real-valued soft bits, e.g. ”0.91”, ”0.95”, ”0.1”, ”0.07”, by the soft bit conversion (SB Conv.) module. In doing so, each of these soft bits is formulated as a differentiable function of $f$ . Not only can they be used together with a differentiable rate estimator, implemented by a learnable neural network with parameter $\theta_{r}$ in Fig. 1, to give an accurate estimate of the coding cost, but they can also be used to approximate $\hat{\bm{f}}$ in a differentiable manner (by the Inv. SB Conv. module).

To sum up, our framework has three networks to be learned end-to-end: the encoder, the decoder, and the rate estimator. Among these, only the encoder and the decoder will actually operate in the test mode, while the rate estimator is activated for training only.

2.2 Soft Bit Conversion

The soft bit conversion plays a central role in enabling our compression system end-to-end trainable. It is to convert the binary, hard-bit representation of the quantization index $q$ of a feature sample $f$ into a differentiable function of $f$ , namely the soft-bit representation. In the previous example, the binary fixed-point representation of $q$ for a feature sample $f=0.81$ is ”1100” when $f$ is quantized uniformly with a step size of $2^{-4}$ . We observe that each of these hard bits $q_{0}=1$ , $q_{1}=1$ , $q_{2}=0$ , $q_{3}=0$ is in fact a function of $f$ . For instance, the first bit $q_{0}$ equals to 1 when $f$ is in the interval $[0.5,1)$ and 0 when in the interval of $[0,0.5)$ . The mappings for the first two bits $q_{0},q_{1}$ are visualized in Fig. 2 (see the hard-bit curves). Apparently, due to their rectangular waveforms, the derivative with respect to $f$ is zero almost everywhere, making the training with back-propagation impossible.

To circumvent this difficulty, we approximate these hard-bit mappings by a superposition of sigmoid functions (see the soft-bit curves in Fig. 2). This is motivated by the fact that any rectangular waveform can be expressed as a superposition of step functions, which in turn can be approximated by sigmoid functions with a adequately chosen hyper-parameter $\alpha$ :

[TABLE]

As an example, it is seen that:

[TABLE]

With this approximation, $\hat{f}$ is modeled by the soft bits using $\tilde{q}_{0}\times 2^{-1}+\tilde{q}_{1}\times 2^{-2}+\tilde{q}_{2}\times 2^{-3}+\tilde{q}_{3}\times 2^{-4}$ in the back-propagation process. Note that one may as well use the soft quantization technique in [3] to model the mapping from $\bm{f}$ to $\hat{\bm{f}}$ directly.

Although our current model implements a power-of-two uniform quantizer, the soft-bit representation for quantization indices can readily be applied to non-uniform quantizers.

2.3 Context-adaptive Bit-plane Coding (CABIC)

Before describing our soft-bit-based rate estimation, we present briefly how the quantization indices $\bm{q}$ of feature maps $\bm{f}$ are coded in the test mode. We first organize $\bm{q}$ into bitplanes. A bitplane is formed collectively by the same binary digits of quantization indices. For example, the most significant bitplane consists of all the $q_{0}$ of feature samples. Bits are then coded starting from the most significant bitplane to the least significant one, with different feature maps processed in the same manner yet separately.

To encode a bitplane, we adopt the context-adaptive binary arithmetic coding technique. Inspired by JPEG2000, we classify every bit into a significant bit or a refinement bit. Using Fig. 3 for illustration, for coding a significant bit of the quantization index at $X$ , we refer to the binary significant status of the surrounding indices at $B$ , $D$ , $E$ and $F$ . This yields a total of 16 context patterns (or ctx values for short), each corresponding to a binary probability model that is updated on-the-fly. For coding a refinement bit, the ctx value is computed based on the bit values of quantization indices at $B,D,E,F$ in the previous bitplane along with those of $A,B,C,D$ in the current bitplane. Since refinement bits are less predictable, we reduce the number of their ctx values to 9 only.

Note that we adopt the traditional hand-crafted design for arithmetic coding because (1) it allows simple adaptation of the context probability model to learn local image statistics and (2) it avoids the need to perform neural network inference at bit level, which introduces extra processing latency in the highly sequential arithmetic decoding process.

2.4 Rate Estimator

To estimate the code length needed to represent an input bit at training time, we refer to its self-information. The self-information of a probabilistic event $\mathcal{E}$ is defined to be the negative logarithm $-\log p(\mathcal{E})$ of its probability $p(\mathcal{E})$ . In our case, the probability of a coding bit $q_{i}$ is maintained in a context probability model, which keeps track of $p(q_{i}|ctx)$ , where $ctx$ denotes its context pattern/value. It is however noted that $p(q_{i}|ctx)$ is approximated by the relative frequency of $q_{i}$ given the $ctx$ , e.g. how many times the event $q_{i}=1$ occurs given the present $ctx$ , which is a statistics quantity not differentiable with respective to $q_{i}$ .

To overcome this problem, we train a rate estimator that includes a neural network as a probability regressor to fit $p(q_{i}|ctx)$ collected from the training data, as illustrated in Fig. 4. In particular, the probability regressor takes as input the soft bits version $\tilde{q}_{i}$ of $q_{i}$ so that it generates non-zero gradient of the estimated rate (computed to be $-\log p(\tilde{q}_{i}|ctx)$ ) with respect to the encoder parameter $\theta_{e}$ :

[TABLE]

It can be seen that if the hard bit mapping is used, the term $d\tilde{q}_{i}/df$ would be replaced with $dq_{i}/df$ , which vanishes.

Eq. (4) additionally gives us some important insights into how the estimated rate cost of an input bit $q_{i}$ would influence the update of the encoder parameter $\theta_{e}$ . Its contribution to the change of $\theta_{e}$ in a gradient update step will be more significant if $q_{i}$ is in its less probable state, i.e., $p(\tilde{q}_{i}|ctx)\leq 0.5$ , or if its conditional probability distribution $p(\tilde{q_{i}}|ctx)$ is more biased, i.e., $\partial p(\tilde{q}_{i}|ctx)/\partial\tilde{q}_{i}$ is larger. The latter occurs when $p(\tilde{q_{i}}=1|ctx)\gg p(\tilde{q_{i}}=0|ctx)$ or vice versa.

3 Training

The encoder, decoder, and rate estimator are trained in two alternating phases. In the first phase, we collect the statistics of the context probabilities $p(q_{i}|ctx)$ from the feature maps, and update the rate estimator $\theta_{r}$ by minimizing the regression error between $p(q_{i}|ctx)$ and $p(\tilde{q}_{i}|ctx)$ . In the second phase, we incorporate the rate estimator to give an estimate of the rate cost $L_{R}(\bm{q})$ and update both the encoder and decoder by minimizing $\lambda\times L_{R}(\bm{q})+L_{D}(\bm{x},\hat{\bm{x}})$ with respect to their network parameters $\theta_{e},\theta_{d}$ . During training, we set the batch size to 8 and the learning rate to $1e^{-4}$ .

The training dataset contains 1,672 images provided by the Challenge on Learned Image Compression (CLIC) 2018 [11]. They are randomly cropped into 128x128 patches, and the horizontal and vertical flipping is performed for data augmentation.

4 Experimental results

This section compares the rate-distortion performances of the proposed method with the other codecs. The comparison is conducted on Kodak dataset [12] by compressing test images at several rates with a varying number of feature maps. Specifically, our encoder is configured to produce 4 feature maps for bits-per-pixel (bpp) lower than 0.25, 8 for bpp’s between 0.25 and 0.5, and 16 for bpp’s higher than 0.5. For every test image, we first calculate the average PSNR and MS-SSIM over its three color components. We then present the average values over the entire dataset as a single quality indicator.

From Fig. 5, we see that our method performs comparably to BPG and Mentzer et al.’s [3] while outperforming JPEG and JPEG2000 by a large margin across a wide range of bpp’s. On the other hand, in terms of PSNR, it is much inferior to BPG but is superior to the other baselines. These observations are in line with the findings of the other researchers that the learning-based methods often show much better MS-SSIM performance, especially at low rates. It is worth pointing out that our model is trained by minimizing the mean-squared error while Mentzer et al. [3] optimize theirs for MS-SSIM. This explains why their method has low PSNR. Fig. 6 further displays reconstructed images produced by these codecs for subjective quality evaluation.

Fig. 7 shows the bit allocation among feature maps due to our soft-bit-based rate-distortion optimization. Three observations can be made: (1) the dynamic range of feature samples is adjusted by the encoder depending on the compression rate, as evidenced by the zero bitplanes at lower bpp’s; (2) some feature maps are more important than the others in the rate-distortion sense, as evidenced by the uneven bit distribution across feature maps; and (3) the bit allocation is spatially varying, as indicated by the uneven bit distribution across different regions. These together produce a net effect similar in spirit to the importance map mechanism [1].

5 Conclusion

This paper introduces a learned image compression system with soft-bit-based rate-distortion optimization. The soft bit representation allows the rate estimation to be tightly coupled with entropy coding, giving an accurate rate estimate. We also show that learning-based compression methods can leverage well-designed coding tools from modern image codecs for a more cost-effective solution.

Bibliography12

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutional networks for content-weighted image compression,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018, pp. 3214–3223.
2[2] O. Rippel and L. Bourdev, “Real-time adaptive image compression,” in International Conference on Machine Learning , 2017, pp. 2922–2930.
3[3] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool, “Conditional probability models for deep image compression,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018, vol. 1, p. 3.
4[4] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in Advances in Neural Information Processing Systems , 2018, pp. 10794–10803.
5[5] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L.V. Gool, “Soft-to-hard vector quantization for end-to-end learning compressible representations,” in Advances in Neural Information Processing Systems , 2017, pp. 1141–1151.
6[6] J. Ballé, V. Laparra, and E. P Simoncelli, “End-to-end optimized image compression,” ar Xiv preprint ar Xiv:1611.01704 , 2016.
7[7] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” ar Xiv preprint ar Xiv:1802.01436 , 2018.
8[8] L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image compression with compressive autoencoders,” in International Conference on Learning Representations , 2017.