Convolutional Neural Networks Considering Local and Global features for   Image Enhancement

Yuma Kinoshita; Hitoshi Kiya

arXiv:1905.02899·eess.IV·May 9, 2019

Convolutional Neural Networks Considering Local and Global features for Image Enhancement

Yuma Kinoshita, Hitoshi Kiya

PDF

Open Access

TL;DR

This paper introduces a CNN architecture that effectively combines local and global features for improved image enhancement, trained on HDR data to restore details lost in low-quality images.

Contribution

The novel CNN architecture integrates local and global feature extraction modules, enhancing image quality beyond existing CNN-based methods.

Findings

01

Outperforms conventional methods in objective quality metrics

02

Utilizes HDR images for superior training data

03

Produces higher-quality images with better detail restoration

Abstract

In this paper, we propose a novel convolutional neural network (CNN) architecture considering both local and global features for image enhancement. Most conventional image enhancement methods, including Retinex-based methods, cannot restore lost pixel values caused by clipping and quantizing. CNN-based methods have recently been proposed to solve the problem, but they still have a limited performance due to network architectures not handling global features. To handle both local and global features, the proposed architecture consists of three networks: a local encoder, a global encoder, and a decoder. In addition, high dynamic range (HDR) images are used for generating training data for our networks. The use of HDR images makes it possible to train CNNs with better-quality images than images directly captured with cameras. Experimental results show that the proposed method can produce…

Tables1

Table 1. Table 1 : Average scores of objective quality metrics

Method	Input	HE based		Retinex based		CNN based
Method	Input	HE	CACHE [2]	SRIE [4]	LIME [3]	DRHT [7]	Proposed
TMQI	0.7547	0.8834	0.8531	0.8109	0.8407	0.7800	0.8556
Entropy	3.7861	6.0027	6.3761	5.3209	6.3984	5.0554	6.4126
NIQE	5.2876	5.5537	5.0932	4.8992	5.0704	9.5137	4.7207
BRISQUE	43.3178	45.4301	45.4564	45.2134	45.3957	44.6974	45.7993

Equations2

x_{i, j} = \tilde{f} (X_{i, j}) = min ((1 + η) \frac{X _{i, j}^{γ}}{X _{i, j}^{γ} + η}, 1),

x_{i, j} = \tilde{f} (X_{i, j}) = min ((1 + η) \frac{X _{i, j}^{γ}}{X _{i, j}^{γ} + η}, 1),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Enhancement Techniques · Image and Video Quality Assessment · Image and Signal Denoising Methods

Full text

Convolutional Neural Networks

Considering Local and Global features for Image Enhancement

Abstract

In this paper, we propose a novel convolutional neural network (CNN) architecture considering both local and global features for image enhancement. Most conventional image enhancement methods, including Retinex-based methods, cannot restore lost pixel values caused by clipping and quantizing. CNN-based methods have recently been proposed to solve the problem, but they still have a limited performance due to network architectures not handling global features. To handle both local and global features, the proposed architecture consists of three networks: a local encoder, a global encoder, and a decoder. In addition, high dynamic range (HDR) images are used for generating training data for our networks. The use of HDR images makes it possible to train CNNs with better-quality images than images directly captured with cameras. Experimental results show that the proposed method can produce higher-quality images than conventional image enhancement methods including CNN-based methods, in terms of various objective quality metrics: TMQI, entropy, NIQE, and BRISQUE.

Index Terms— Image enhancement, High dynamic range images, Deep learning, Convolutional neural networks

000This work was supported by JSPS KAKENHI Grant Number JP18J20326.

1 Introduction

The low dynamic range (LDR) of modern digital cameras is a major factor that prevents cameras from capturing images as well as human vision. This is due to the limited dynamic range that imaging sensors have, resulting in low-contrast images. Enhancing such images reveals hidden details.

Various kinds of research on single-image enhancement have been reported [1, 2, 3, 4, 5]. Most image enhancement methods can be divided into two types: histogram equalization (HE)-based methods and Retinex-based methods. However, both HE- and Retinex-based methods cannot restore lost pixel values due to quantizing and clipping. The problem leads to banding artifacts in enhanced images. For this reason, image enhancement methods based on convolutional neural networks (CNNs) are expected to solve the problem that traditional methods have.

CNNs have been successfully used in many image-to-image translation tasks such as single image super-resolution, image inpainting, and image segmentation. Recently, CNN-based image enhancement methods have also been developed [6, 7, 8, 9, 10, 11]. Most of these methods employ a U-Net [12]-based network architecture. However, U-Net cannot extract global features of images when large input images are given or when U-Net itself does not have a sufficient number of layers. In this case, output images are often distorted because both local and global features are needed for image enhancement, whereas only local ones are needed for other tasks such as super-resolution.

In this paper, we propose a novel CNN architecture that considers local and global features for image enhancement. The proposed architecture consists of three networks: a local encoder, a global encoder, and a decoder. The use of the local encoder and the global encoder enables us to extract both local and global features, even when large images are given as inputs. In the decoder, these features are combined to produce enhanced images. The proposed CNNs can be trained with small images, and images of various sizes can be enhanced by the CNNs without distortions. In addition, we utilize tone-mapped images from existing high dynamic range (HDR) images for the training [13, 14]. Target LDR images mapped from HDR ones have better quality than those directly captured from cameras because HDR images contain more information.

We evaluate the effectiveness of the proposed image enhancement network in terms of the quality of enhanced images in a number of simulations, where the tone mapped image quality index (TMQI), discrete entropy, naturalness image quality evaluator (NIQE), and blind/referenceless image spatial quality evaluator (BRISQUE) are utilized as quality metrics. Experimental results show that the proposed method outperforms state-of-the-art contrast enhancement methods in terms of those quality metrics. Furthermore, the proposed method does not cause distortions in output images, unlike CNNs that do not consider global features.

2 Related work

Many image enhancement methods have been studied [1, 2, 3, 4, 5, 15]. Among the methods, HE has received the most attention because of its intuitive implementation quality and high efficiency. It aims to derive a mapping function such that the entropy of a distribution of output luminance values can be maximized. However, HE often causes over-enhancement. To avoid this, numerous improved methods based on HE have also been developed [1, 2]. Another way for enhancing images is to use the Retinex theory [16]. Retinex-based methods [3, 4] decompose images into reflectance and illumination, and then enhance images by manipulating illumination. However, without the use of CNNs, these methods cannot restore lost pixel values due to clipping and quantizing.

Recently, a few CNN-based methods were proposed for single image enhancement [6, 7, 8, 9, 10, 11]. Chen’s method [6] provides high-quality RGB images from single raw images taken under low-light conditions, but this method cannot be applied to RGB color images that are not raw images. Yang’s et al. [7] proposed a method for enhancing RGB images by using two CNNs. This method generates intermediate HDR images from input RGB images, and then produces high-quality LDR images. However, generating HDR images from single images is a well-known outstanding problem [17, 18]. For this reason, the performance of Yang’s method is limited by the quality of the intermediate HDR images. Furthermore, network architectures of existing methods [6, 7, 8, 9] only handle local image information, so output images can be distorted by CNNs due to a lack of global image information required for image enhancement, which is addressed later in this paper. Therefore, we aim to improve the performance of image enhancement by using a novel network architecture that considers both local and global image information.

3 Proposed method

Figure 1 shows an overview of our training procedure and predicting procedure. In the training, all input LDR images $x$ and target LDR images $y$ are generated from HDR images by using a virtual camera [17] and tone mapping, respectively. This enables us to generate high-quality target images.

After the training, various LDR images are applied to the proposed CNNs as input images, where the CNNs then generate high-quality LDR images. Detailed training conditions are described in Section 3.2.

3.1 Network architecture

Figure 2 shows the overall network architecture of the proposed method. The architecture consists of three networks: a local encoder, a global encoder, and a decoder. The local encoder and the decoder in the proposed method are almost the same as those used in U-Net [12]. Concatenated skip connections between the local encoder and the decoder is also utilized like in U-Net. Although U-Net works very well for various image-to-image translation problems, its use for image enhancement often causes distortions in output images [19]. This is due to its network architecture that cannot handle global image information (see Section 4). For this reason, we utilize the global encoder and combine features extracted by both encoders to prevent the distortions. The input for the local encoder is a $H\times W$ pixels 24-bit color LDR image. For the global encoder, the input image is resized to a fixed size ( $128\times 128$ ).

The proposed CNNs have five types of layers as shown in Fig. 5:

$3\times 3$ ** Conv.+ BN + ReLU**

which calculates a $3\times 3$ convolution with a stride of $1$ and a padding of $1$ . After convolution, batch normalization [20] and the rectified linear unit activation function [21] (ReLU) are applied. In the local encoder and the decoder, two adjacent $3\times 3$ Conv.+ BN + ReLU layers will have the same number $K$ of filters. From the first two layers to the last ones, the number of filters are $K=32$ , $64$ , $128$ , $256$ , $512$ , $256$ , $128$ , $64$ , and $32$ , respectively. In the global encoder, all layers have $64$ filters.

$2\times 2$ ** Max pool**

which downsamples feature maps by max pooling with a kernel size of $2\times 2$ and a stride of $2$ .

$4\times 4$ ** Transposed Conv. + BN + ReLU**

which calculates a $4\times 4$ convolution with a stride of $1/2$ and a padding of $1$ . After convolution, BN and ReLU are applied. From the first layer to the last one the numbers of filters are $K=256$ , $128$ , $64$ , and $32$ , respectively.

$3\times 3$ ** Conv. + ReLU**

which calculates a $3\times 3$ convolution with a stride of $1$ and a padding of $1$ . After convolution, ReLU is applied. The number of filters in the layer is $3$ .

$4\times 4$ ** Conv. + BN + ReLU (w/o padding)**

which calculates a $4\times 4$ convolution without padding. The number of filters in the layer is $64$ .

3.2 Training

Numerous LDR images taken under various conditions, $x$ , and corresponding high-quality images, $y$ , are needed to train the CNNs in the proposed method. However, collecting a sufficient number is difficult. We therefore utilize HDR images $E$ to generate both $x$ and $y$ by using a virtual camera [17] and a tone mapping operation with Mertens’s fusion method [22], respectively. The use of HDR images also makes it possible to generate better-quality images than those directly captured with cameras. For training, 831 HDR images were collected from online available databases [23, 24, 25, 26, 27, 28].

The training procedure of our CNNs is shown as follows.

i

Select 16 HDR images from the 831 HDR images at random. 2. ii

Generate 16 pairs of an input and target LDR images ( $x$ , $y$ ) from each HDR image. Each pair is generated in accordance with the following steps.

(a)

Crop HDR image $E$ to an image patch $\tilde{E}$ at $N\times N$ pixels. The size $N$ is given as a product of a uniform random number in the range $[0.2,0.6]$ and the length of the short side of $E$ . In addition, the position of the patch in $E$ is also determined at random. 2. (b)

Resize $\tilde{E}$ to $256\times 256$ pixels. 3. (c)

Flip $\tilde{E}$ horizontally or vertically with a probability of 0.5. 4. (d)

Calculate exposure $X$ from $\tilde{E}$ by $X_{i,j}=\Delta t(v)\cdot\tilde{E}_{i,j}$ , where $(i,j)$ denotes a pixel. Shutter speed $\Delta t$ is calculated as $\Delta t(v)=0.18\cdot 2^{v}/G(\tilde{E})$ as in [13] by using a uniform random number $v$ in the range $[-4,0]$ . $G(\tilde{E})$ is the geometric mean of luminance of $\tilde{E}$ , 5. (e)

Generate an input LDR image $x$ from $X$ by using a virtual camera $\tilde{f}$ , as

[TABLE]

where $\eta$ and $\gamma$ are random numbers that follow normal distributions with a mean of $0.6$ and a variance of $0.1$ and with a mean of $0.9$ and a variance of $0.1$ , respectively. Note that eq. (1) is applied to the luminance of $X$ , and RGB pixel values of $x$ are then obtained so that the RGB value ratios of $x$ are equal to those of $X$ . 6. (f)

Generate a target LDR image $y$ from $\tilde{E}$ by using a tone mapping operator $\hat{f}$ as $y=\hat{f}(\tilde{E})$ . Here, Mertens’s multi exposure fusion method [22] is used in the tone mapping operator: $\hat{f}(\tilde{E})=\mathcal{F}(Y^{(-2)},Y^{(0)},Y^{(2)})$ , where, similarly to $X$ , exposure $Y^{(u)}_{i,j}$ is given as $Y^{(u)}_{i,j}=\Delta t(u)\tilde{E}_{i,j}$ , and $\mathcal{F}(Y^{(-2)},Y^{(0)},Y^{(2)})$ indicates a function for fusing the three exposures by Mertens’s method. 3. iii

Predict 16 LDR images $\hat{y}$ from 16 input LDR images $x$ by using the CNNs. 4. iv

Evaluate errors between predicted images $\hat{y}$ and target images $y$ by using the mean squared error. 5. v

Update filter weights $\omega$ and biases $b$ in the CNNs by back-propagation.

In our experiments, the CNNs were trained with 500 epochs, where the above procedure was repeated 51 times in each epoch. In addition, each HDR image had only one chance to be selected, in Step i in each epoch. He’s method [29] was used for initializing of the CNNs. In addition, the Adam optimizer [30] was utilized for optimization, where parameters in Adam were set as $\alpha=0.002,\beta_{1}=0.9$ , and $\beta_{2}=0.999$ .

4 Simulation

We evaluated the effectiveness of the proposed method by using four objective quality metrics.

4.1 Simulation conditions

In this experiment, test LDR images were generated from 283 HDR images that were not used for training, in accordance with Steps ii(a)–(e). In addition, test LDR images were resized to $512\times 512$ pixels in Step ii(b).

The quality of LDR images $\hat{y}$ generated by the proposed method was evaluated by four objective quality metrics: the tone mapped image quality index (TMQI) [31], discrete entropy, the naturalness image quality evaluator (NIQE) [32], the blind/referenceless image spatial quality evaluator (BRISQUE) [33], where the original HDR image $\tilde{E}$ was utilized as a reference for TMQI, and a database [34, 35] was used to build a model in BRISQUE.

The proposed method was compared with five conventional methods: histogram equalization (HE), contrast-accumulated histogram equalization (CACHE) [2], simultaneous reflectance and illumination estimation (SRIE) [4], low-light image enhancement via illumination map estimation (LIME) [3], and deep reciprocating HDR transformation 111An approximate implementation at https://github.com/ybsong00/DRHT was utilized (DRHT) [7], where SRIE and LIME are Retinex-based methods and DRHT is a CNN-based one.

4.2 Results

Table 1 illustrates the average scores of the objective assessment for 283 images, in terms of TMQI, entropy, NIQE, and BRISQUE. In the case of TMQI, a larger value means a higher similarity between a target LDR image and an original HDR image. A larger value for entropy and BRISQUE indicates that a target LDR image has higher quality. By contrast, a smaller value for NIQE indicates that a target LDR image has less distortions such as noise or blur. As shown in Table 1, the proposed method provided the highest average scores for entropy, NIQE, and BRISQUE, in the six methods. HE also provided high TMQI scores, but HE cannot restore lost pixel values and is well known to often causes over-enhancement in bright areas [9]. For these reasons, the proposed method outperforms the conventional methods in terms of the four metrics.

Figure 3 shows an example of the images enhanced by the six methods. From the figure, it is confirmed that the proposed method produced a higher-quality image that clearly represents dark areas in the image. The clarity of the image generated by DRHT is low due to the difficulty of generating an intermediate HDR image. In addition, the proposed method enhanced images without banding artifacts due to the quantized pixel values in the input images, but images enhanced by HE and CACHE include banding artifacts as shown in Fig. 4. This is because CNNs enhance images by taking into account local information around each pixel in images, whereas HE-based methods perform pixel-wise operations. The result denotes that the proposed method can not only enhance images, but also restore lost pixel values due to the quantization.

Figure 5 illustrates that the effectiveness of the global encoder in the proposed network architecture. When the global encoder was not employed, CNNs caused images to be distorted, creating block-like artifacts. In contrast, the distortions were prevented when the global encoder was employed. Hence, the global encoder is effective to prevent distortions in enhanced images.

An enhanced image result from an image that directly captured with a digital camera is shown in Fig. 6, where the input image was selected from [36]. From the figure, it is confirmed that the proposed method is also effective for ordinary images, even when HDR images are only used to train CNNs.

Those experimental results show that the proposed network architecture is effective for enhancing single images.

5 Conclusion

In this paper, we proposed a novel CNN architecture considering both local and global features for image enhancement. The proposed architecture consists of a local encoder, a global encoder, and a decoder. The global encoder can effectively prevent distortions due to the lack of global image information required for image enhancement. Furthermore, the use of HDR images for training enables us to obtain higher-quality target images than those captured with cameras. Experimental results showed that the proposed method outperformed state-of-the-art conventional image enhancement methods including Retinex- and CNN-based methods, in terms of TMQI, entropy, NIQE, and BRISQUE. In addition, visual comparison results demonstrated that the proposed method can effectively enhance images without distortions such as banding and block artifacts. The performance of the proposed method depends on that of tone mapping. Hence, developing high-performance tone mapping operators will also be useful to image enhancement.

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] K. Zuiderveld, “Contrast Limited Adaptive Histogram Equalization,” in Graphics gems IV , Elsevier, 1994, pp. 474–485.
2[2] X. Wu, X. Liu, K. Hiramatsu, and K. Kashino, “Contrast-accumulated histogram equalization for image enhancement,” in Proceedings of IEEE International Conference on Image Processing , Sep 2017, pp. 3190–3194.
3[3] X. Guo, Y. Li, and H. Ling, “LIME: Low-Light Image Enhancement via Illumination Map Estimation,” IEEE Transactions on Image Processing , vol. 26, no. 2, pp. 982–993, Feb 2017.
4[4] X. Fu, D. Zeng, Y. Huang, X.-P. Zhang, and X. Ding, “A Weighted Variational Model for Simultaneous Reflectance and Illumination Estimation,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition , Jun 2016, pp. 2782–2790.
5[5] Y. Kinoshita and H. Kiya, “Automatic exposure compensation using an image segmentation method for single-image-based multi-exposure fusion,” APSIPA Transactions on Signal and Information Processing , vol. 7, e 22, Dec 2018.
6[6] C. Chen, Q. Chen, J. Xu, and V. Koltun, “Learning to See in the Dark,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition , Jun 2018, pp. 3291–3300.
7[7] X. Yang, K. Xu, Y. Song, Q. Zhang, X. Wei, and R. W. Lau, “Image Correction via Deep Reciprocating HDR Transformation,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition , Jun 2018, pp. 1798–1807.
8[8] J. Cai, S. Gu, and L. Zhang, “Learning a Deep Single Image Contrast Enhancer from Multi-Exposure Images,” IEEE Transactions on Image Processing , vol. 27, no. 4, pp. 2049–2062, Apr 2018.