Deep Semantics-Aware Photo Adjustment
Seonghyeon Nam, Seon Joo Kim

TL;DR
This paper introduces a deep neural network for semantics-aware photo adjustment that models scene context and user preferences, outperforming previous methods in quality and customization capabilities.
Contribution
It proposes a novel deep learning approach using bilinear models and semantic adjustment maps for improved, user-customizable photo retouching.
Findings
Outperforms existing methods quantitatively and qualitatively.
Enables user customization of photo adjustments.
Effectively models scene context for photo retouching.
Abstract
Automatic photo adjustment is to mimic the photo retouching style of professional photographers and automatically adjust photos to the learned style. There have been many attempts to model the tone and the color adjustment globally with low-level color statistics. Also, spatially varying photo adjustment methods have been studied by exploiting high-level features and semantic label maps. Those methods are semantics-aware since the color mapping is dependent on the high-level semantic context. However, their performance is limited to the pre-computed hand-crafted features and it is hard to reflect user's preference to the adjustment. In this paper, we propose a deep neural network that models the semantics-aware photo adjustment. The proposed network exploits bilinear models that are the multiplicative interaction of the color and the contexual features. As the contextual features we…
| Effects | |||
| Foreground Pop-Out | Local Xpro | Watercolor | |
| Input | 13.86 | 19.71 | 15.30 |
| Yan et al. [4] | 7.08 | 7.43 | 7.20 |
| SA-AdjustNet+MSE | 7.16 | 7.06 | 6.92 |
| SA-AdjustNet+Huber | 6.59 | 6.97 | 6.81 |
| SA-AdjustNet+Huber+MT | 5.92 | 6.66 | 6.75 |
| SA-AdjustNet+Huber+MT+S | 5.86 | 7.03 | 6.83 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Image Enhancement Techniques · Advanced Neural Network Applications
Deep Semantics-Aware Photo Adjustment
Seonghyeon Nam
Department of Computer Science
Yonsei University
&Seon Joo Kim
Department of Computer Science
Yonsei University
Abstract
Automatic photo adjustment is to mimic the photo retouching style of professional photographers and automatically adjust photos to the learned style. There have been many attempts to model the tone and the color adjustment globally with low-level color statistics. Also, spatially varying photo adjustment methods have been studied by exploiting high-level features and semantic label maps. Those methods are semantics-aware since the color mapping is dependent on the high-level semantic context. However, their performance is limited to the pre-computed hand-crafted features and it is hard to reflect user’s preference to the adjustment. In this paper, we propose a deep neural network that models the semantics-aware photo adjustment. The proposed network exploits bilinear models that are the multiplicative interaction of the color and the contexual features. As the contextual features we propose the semantic adjustment map, which discovers the inherent photo retouching presets that are applied according to the scene context. The proposed method is trained using a robust loss with a scene parsing task. The experimental results show that the proposed method outperforms the existing method both quantitatively and qualitatively. The proposed method also provides users a way to retouch the photo by their own likings by giving customized adjustment maps.
1 Introduction
With the growing number of digital cameras especially with smartphones, photo retouching softwares have become popular among amateur photographers. As the captured photos are usually flat, many people want to adjust the tone and the color of the photos, to make the pictures to look visually more impressive and even stylized. However, the photo retouching is a hard task for the amateur users without the expertise in the photo editing. Additionally, retouching a large photo collection requires extensive human labor.
For this reason, many techniques for automatic photo adjustment have been widely studied. The automatic photo adjustment automatically enhances photos’ tone and color to be visually more pleasing without human actions. In the automatic photo retouching, the output styles mimic the photo styles of professional photographers. Several methods have been proposed to adjust the contrast/brightness and the color/saturation of photos [1, 2] based on low-level color histogram, the brightness, and the contrast of images. However, those methods adjust photos globally by applying the same color mapping to all pixels in an image. Note that most photographers prefer locally varying adjustments in their work.
Some works have focused on spatially varying photo adjustment that exploits high-level scene contexts based on the object features and the saliency [3, 4]. In [4], the authors use a feed-forward neural network to learn the semantics-aware photo adjustment styles of professional photographers. In the semantics-aware photo adjustment, the tone and the color mapping are dependent on the scene context, which is a local regions of a given image. The authors proposed multi-scale pooling features of the semantic label map to model the context dependency. However, the work uses hand-designed features, and it is unclear whether their hand-designed features based on inaccurate semantic label map are optmial. In addition, the learned representation of the method is not separated, and therefore users cannot control the adjustment by their own preference.
In this paper, we propose a deep neural network (DNN) that learns the representation of the semantics-aware photo adjustment in an end-to-end manner. While we make use of the dataset from [4], we approach the problem in a different way. First, the proposed network is trained in an end-to-end manner so that it fits better to the data. Our network is a bilinear model where the color and the contextual information is interacted in a multiplicative way. We exploit multi-scale convolutional neural network (CNN) features to characterize pixel-wise contextual features. Unlike [4], the contextual features are learned within the network in an end-to-end manner. To efficiently train the network, we make use of a robust loss function and the multi-task learning with a scene parsing task. Second, as another type of contextual features, we introduce a semantic adjustment map. The semantic adjustment map is a binary segmentation map that discovers the photo retouching presets which vary according to the semantic contexts. The network automatically disentangles different types of presets from the original in an unsupervised manner and adjust images accordingly. By doing so, we can understand better the photo retouching styles and use the discovered presets to adjust the photos for each user’s preference. Note that our photo adjustment framework is different from the image style transfer [5] that stylizes photos to look like artworks. Instead of focusing on the global modification of shapes and textures, we focus on the tone and the color manipulation of images.
2 Related works
There has been a number of studies for the automatic photo adjustment. Several methods focus on the global tonal adjustment [1], the color enhancement [6], and the personalized enhancement [2]. Those methods are global adjustment approaches based on hand-crafted low-level features such as the color histogram, the scene brightness, and the highlight clipping. In [2], Kapoor et al. proposed a method that discovers the clusters of users that have similar preferences of image enhancement for the personalized adjustment. While the concept of our method may be similar to those methods, the main difference is that we aim to discover the retouching presets that vary according to the local semantics.
Hwang et al. [3] presented a locally varying photo enhancement method that is based on both low- and high-level contexts. Their method finds an appropriate color mapping from external images using pixel-wise contextual features. The work of Yan et al. [4] is closely related to our work. The authors combine multiple hand-crafted features including a multi-scale pooling of a scene parsing map for semantics-aware color regression. While the multi-scale pooling features were effective in modelling the semantics-aware photo adjustment, the performance is limited to the quality of the scene parsing map since the features are not trained in an end-to-end manner.
Our method is also related to various deep learning based semantics-aware image processing methods. Tsai et al. [7] used a scene parsing deep network to localize a sky region and transfer a different style of sky from external images. In [8], the authors propose a DNN for image harmonization, which is an encoder-to-decoder network to exploit high-level contextual features. The DNN is jointly trained with a scene parsing task to improve the training. In contrast to [8], our method does not rely on the segmentation mask and rather finds the inherent segmentation masks from the data. Deep learning based colorization methods [9, 10] are also related to our work in that the methods make use of rich contextual features of CNNs to estimate the color of a pixel according to the scene context. Unlike those methods, we do not reconstruct missing color channels, and the color mapping of pixels is consistent in a semantic region.
3 Method
3.1 Overview
We define the semantics-aware photo adjustment problem as a regression problem. We want to find a regression model of the color mapping from the input color to the output color according to the semantic context that the input pixel belongs to. To this end, we propose a deep neural network that effectively learns the context dependent color mapping.
Figure 1 shows the overview of the proposed deep network. Our network is divided into two parts: a feature extraction network and a bilinear regression network. The feature extraction network is based on the ResNet-50 [11] as shown in Fig. 1 (a). The contextual features of the ResNet-50 are effective for modelling the semantics-aware color mapping, since we can exploit low to high level pixel-wise features that are pretrained on a large dataset. However, those convolutional features only describe the local context. For the better context modelling, the global context and the relative compositional context between scene objects would be useful. Therefore, we add a spatial RNN to extract those global and relative contexts. We adopt the ReNet [12] that consists of 4 directional spatial RNN layers, followed by an additional 11 convolution. To avoid the overfitting, we use GRU [13] as a spatial RNN cell with batch normalization [14].
The bilinear regression network shown in Figure 1 (b) estimates the output color given both the input color features and the contexual features. In the following, we describe the bilinear regression network in detail.
3.2 Bilinear model
Bilinear models are the multiplicative interaction of all elements between two vectors [15, 16, 17]. Formally, a bilinear model is defined as
[TABLE]
where , are feature vectors, and is the interaction between two vectors.
In the semantics-aware photo adjustment, it is natural to think that the color mapping is determined by two factors; one is the color of a pixel and the other is the scene context that the pixel belongs to. Therefore, we use the bilinear model to represent the interaction between both factors. Since is usually high-dimensional, we follow the low-rank bilinear pooling method of Kim et al. [16] to reduce the parameters. Based on the method, the output color is represented as
[TABLE]
where is color features, is context features, , , are the decomposition of , and , , are addtional biases. is an element-wise multiplication and we use as a nonlinear function . Note that is actually a residual since we add a skip connection between the input and : .
The method of Yan et al. [4] exploits an asymmetric form of bilinear model [15] by estimating affine transformaion matrices to map quadratic color features to output colors. On the other hand, our method is more flexible and efficient in that our bilinear model learns the nonlinear interaction of two features as well as both feature representations. For both cases, it is clear that merging two features in a multiplicative manner is beneficial for the semantics-aware photo adjustment.
3.2.1 Color features
We use the CIELab color space for both the input and output images. We can use 3-channel Lab color as the color features. However, it generates color variations in smooth regions since each color is processed independently. To alleviate this issue, we add the local neighborhood information by concatenating the Lab color and the normalized first-layer convolutional feature maps of ResNet-50.
3.2.2 Contextual features
Convolutional features
We first take advantage of the multi-scale convolutional features. To generate pixel-wise features from the multi-scale feature maps, we adopt the sparse hypercolumn training method [18, 10], which requires much less parameters than the deconvolutional approaches [19, 20]. In the training time, we generate many training signals by randomly sampling sparse pixels from the image for the backpropagation. When we are given a small data, we can exploit both low to high level features efficiently with this approach.
We use the first 3 residual blocks for the hypercolumn, which have 256, 512, and 1024 channels, respectively. As mentioned, we additionally use spatial RNN features that have 1024 channels. We normalize each feature map by its norm, concatenate them, and squeeze the feature dimension to 512 by using 11 convolution as shown in the option 1 of Fig. 1 (b).
Semantic adjustment map
As the convolutional features are unconstrained and smooth, they can represent rich scene contexts. However, two real-valued bilinear features are highly correlated, and it is difficult to understand which factor contributes to a specific style of color mapping. It would be better if we can separate those factors not only to interpret the retouching styles according to the scene contexts, but to make use of those styles for our own taste.
To this end, we generate K-channel binary maps, of which each channel is a binary segmentation map that one of the retouching presets is applied to. For each pixel, an one-hot vector is a categorical random variable, which is defined as
[TABLE]
where is a one-hot vector sampled from a categorical probability density function . is a probability of retouching a pixel using the k-th retouching preset. Similar to [21], we reformulate our regression loss using a variational lowerbound technique, which is described as
[TABLE]
In our task, K is typically small enough to compute the exact expectation if we assume that the pixels are independent to each other. In practice, however, it is likely that the problem converges to a local minimum that all retouching styles are classified to one or two classes. It is because the number of traininig examples for each retouching style is imbalanced. In other words, the optmization is dominanted by a few large classes such as the sky and the ground. In [22], the authors use a class reweighting trick for class-balanced classification. Similarly, we multiply different weights to each K loss term to alleviate the issue. In contrast to [22], we multiply small weights to the loss term of low-frequency classes so that small classes are easily discovered in spite of relatively small training signals. The weight is defined as
[TABLE]
where controls the contribution of the weight to the loss. is the moving average of normalized soft frequences of K classes that is computed from training batches defined as
[TABLE]
where is the average of for all pixels in a -th batch. Our final regression loss is formulated as
[TABLE]
3.3 Huber loss
To generate the ground truth of adjusted photos, photographers use a segmentation tool to localize a region of a specific object to retouch. Although they thoroughly follow the procedure, some outliers may exist around object boundaries due to the incorrect segmentation. Also, the adjustment style of a photographer may not be consistent from an image to another image. Therefore, the optimization of our deep network should be robust to such outliers.
As a training objective, loss is widely used in various color regression tasks [4, 10]. However, DNNs easily overfit to outliers since the gradient of loss is large for those outlier samples and the optimization is dominanted by them. As an alternative to , Huber loss [23] is more robust to outliers, which is defined as
[TABLE]
where is error and is the changepoint between the two loss functions. The loss is quadratic for a small error , and linear for a large error . As the gradient of the linear function is always , the contribution of outliers in the optimization is reduced.
3.4 Multi-task learning
Unfortunately, getting a large labeled dataset for the photo adjustment is not easy, since photo editing requires tremendous human labor. When the proposed network is trained on such a small dataset, it is highly likely to overfit to a few specific scene contexts. Since pixel-wise semantic information is the key to our semantics-aware photo adjustment, the overfitting is very severe and results in inconsistent color mappings. To mitigate this problem, we simultaneously train a scene parsing task with our task as a regularization, thereby our deep network can be generalized to any scene contexts.
To train the scene parsing task, we use the SceneParse150 dataset [24], which consists of 150 scmantic categories. As depicted in Fig. 1, we simply add a softmax layer to the top of a contextual feature layer. Since our goal is not to make a good scene parsing network, our configuration is enough to regularize our main task. Also, our objective function changes to the following
[TABLE]
where is a cross-entropy loss of scene parsing task and is a regularization weight.
3.5 Implementation
We implemented the proposed method using the TensorFlow running on a GeForce GTX 1080 GPU. With this setup, 500 epochs of training the network only takes several hours.
Data augmentation
As the number of images in the dataset is small, the data augmentation is essential. To generate more training data, we randomly rotate the input images from -10 to 10 degrees and flip horizontally. We fill empty space by repeating pixel values of image boundaries to keep the dimension of image as 512512. As mentioned, we adopt the sparse training method [18, 10] that randomly samples a few pixels for the backpropagation. By doing this, we can generate many training examples from a small dataset. In our implementation, we randomly choose 2048 pixels from an image for the sparse training.
Hyperparameters
We train the proposed network using the Adam [25] optimization method with the learning rate of 1e-4 and the batch size of 4. The ResNet-50 layers are finetuned with 0.5x lower learning rate. We set for training the semantic adjustment map to 0.8, of huber loss to 0.04, and of cross-entropy loss of scene parsing task to 0.01 after the cross-validation. Determining the optimal number K is difficult as it is an unsupervised clustering problem. In our experiment, we found that 2, 4, and 2 for Foreground Pop-Out, Local Xpro, and Watercolor are sufficient for both the quantitative and qualitative result.
4 Experiments
4.1 Dataset
As mentioned, we use the dataset from [4], which is the only publicly available dataset for the semantics-aware photo adjustment. It contains 115 images from Flickr, of which the larger dimension is 512 pixels. In [4], the authors select 70 images for the training and the remaining 45 images for the testing. We use the same training and testing sets for a fair comparision. But, we additionally choose 10 images from the training set for the validation. Therefore, our training set is actually smaller than that of [4].
In the dataset, there are 3 types of photo adjustment effects: Foreground Pop-Out, Local Xpro, and Watercolor. For the Foreground Pop-Out effect, the contrast and the color saturation of foreground salient objects are increased while those of background objects are decreased. Local Xpro effect changes the brightness/contrast and the color of objects according to the predefined profiles for each semantic category. The adjustment of Watercolor is similar to that of Foreground Pop-Out except for an additional brush effect. In [4], the authors emulated the brush effect using superpixel segmentation [26]. As our objective is to model spatially varing color mapping not texture, we follow the same procedure in [4] for the brush effect.
4.2 Baselines
To show the effectiveness of the proposed method, we compare it with the method of Yan et al. [4]. As mentioned, we use the same training and the testing sets as described in [4] except for the validation set. We also compare various design choices of the proposed method. For the easy reading, we name the proposed deep network as Semantics-Aware Adjustment Network (SA-AdjustNet), and we compare several variations of the SA-AdjustNet: SA-AdjustNet+MSE, SA-AdjustNet+Huber, SA-AdjustNet+Huber+MT, and SA-AdjustNet+Huber+MT+S. Each suffix after the name is the variation applied. MSE and Huber refer to the type of regression loss function, MT is the multi-task learning, and S indicates the network uses the semantic adjustment map as the contextual features. The networks without S use the convolutional features instead of the semantic adjustment map.
4.3 Experimental results
Quantitative analysis
Table 1 shows the quantitative results of the proposed method. The values in the table are distance in the Lab color space. In most cases, the performance of the SA-AdjustNet is better than the method of [4] since both the color and the contextual features of our method are jointly trained with the bilinear regression network. As shown in the table, the Huber loss and the multi-task learning are both effective for the regularization of the training of the proposed network. For the SA-AdjustNet+Huber+MT+S, the performance is competitive with that of the SA-AdjustNet+Huber+MT for the Foreground Pop-Out and Watercolor since the foreground and the background are balanced. However, the classes in the Local Xpro effect are diverse and imbalanced, and the optimal clustering is more difficult even if we use the class reweighting.
Qualitative analysis
Figure 2 shows some of the qualitative results from the test set. Each row of the figures show the 3 kinds of photo adjustment styles: Foreground Pop-Out, Local Xpro, and Watercolor. In most cases, the adjusted images using the proposed method are more visually pleasing and closer to the ground truth than those of Yan et al. [4]. As shown in the house of the 3rd row of Fig. 2, the inconsistent color variation due to the incorrect segmentation is clearly reduced. Figure 3 show some examples of the semantic adjustment map. The proposed network effectively discovers the inherent photo retouching styles. However, the semantic adjustment maps are discrete, and it results in the abrupt change of color around incorrect semantic boundaries as shown in the head of the man in Fig. 3. This problem could be mitigated by considering neighborhood dependent models such as conditional random fields.
4.4 Application: personalization of semantics-aware photo adjustment
Although the proposed method provides the users with automatically adjusted photos, some users may want their photos to be retouched by their own preference. In the first row of Fig. 2 for example, a user may want only the color of the people to be changed. For such situations, we provide a way for the users to give their own adjustment maps to the system. Figure 4 shows some examples of the personalization. When the input image is forwarded, we substitue the extracted semantic adjustment map with the new adjustment map from the user. As shown in the figure, the proposed method effectively creates the personalized images adjusted by user’s own style.
5 Conclusion
In this paper, we proposed a deep neural network for the semantics-aware photo adjustment. The proposed network learns the bilinear relationship between the color and the spatially varying scene context. With the semantic adjustment map, we can discover the inherent photo retouching presets within a style and apply it for the personalized photo adjustment. To effectively train the network, we use a robust loss function and the multi-task learning with the scene parsing task. The experimental results show that the proposed network outperforms an existing method both quantitatively and qualitatively.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] V. Bychkovsky, S. Paris, E. Chan, and F. Durand, “Learning photographic global tonal adjustment with a database of input/output image pairs,” in IEEE Proc. of CVPR , pp. 97–104, IEEE, 2011.
- 2[2] A. Kapoor, J. C. Caicedo, D. Lischinski, and S. B. Kang, “Collaborative personalization of image enhancement,” IJCV , vol. 108, no. 1-2, pp. 148–164, 2014.
- 3[3] S. Hwang, A. Kapoor, and S. Kang, “Context-based automatic local image enhancement,” Proc. of ECCV , pp. 569–582, 2012.
- 4[4] Z. Yan, H. Zhang, B. Wang, S. Paris, and Y. Yu, “Automatic photo adjustment using deep neural networks,” ACM TOG , vol. 35, no. 2, p. 11, 2016.
- 5[5] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in IEEE Proc. of CVPR , June 2016.
- 6[6] J. Yan, S. Lin, S. Bing Kang, and X. Tang, “A learning-to-rank approach for image color enhancement,” in IEEE Proc. of CVPR , pp. 2987–2994, 2014.
- 7[7] Y.-H. Tsai, X. Shen, Z. Lin, K. Sunkavalli, and M.-H. Yang, “Sky is not the limit: Semantic-aware sky replacement,” Proc. of SIGGRAPH , vol. 35, no. 4, 2016.
- 8[8] Y.-H. Tsai, X. Shen, Z. Lin, K. Sunkavalli, X. Lu, and M.-H. Yang, “Deep image harmonization,” Co RR , vol. abs/1703.00069, 2017.
