Color Constancy Convolutional Autoencoder
Firas Laakom, Jenni Raitoharju, Alexandros Iosifidis, Jarno Nikkanen,, Moncef Gabbouj

TL;DR
This paper explores pre-training methods using convolutional autoencoders to improve color constancy, addressing data scarcity and overfitting, and achieving competitive results with fewer parameters.
Contribution
Introduces two novel pre-training approaches based on convolutional autoencoders for color constancy, including an unsupervised and a semi-supervised method with a new composite-loss function.
Findings
Achieves competitive results with fewer parameters.
Addresses data scarcity in color constancy datasets.
Studies overfitting on diverse camera datasets.
Abstract
In this paper, we study the importance of pre-training for the generalization capability in the color constancy problem. We propose two novel approaches based on convolutional autoencoders: an unsupervised pre-training algorithm using a fine-tuned encoder and a semi-supervised pre-training algorithm using a novel composite-loss function. This enables us to solve the data scarcity problem and achieve competitive, to the state-of-the-art, results while requiring much fewer parameters on ColorChecker RECommended dataset. We further study the over-fitting phenomenon on the recently introduced version of INTEL-TUT Dataset for Camera Invariant Color Constancy Research, which has both field and non-field scenes acquired by three different camera models.
| Method | Type | Best 25% | Mean | Med. | Tri. | Worst25% | |
|---|---|---|---|---|---|---|---|
| statistic-based | learning-based | ||||||
| Grey-World [14] | ✓ | – | 5.0 | 9.7 | 10 | 10 | 13.7 |
| White-Patch [13] | ✓ | – | 2.2 | 9.1 | 6.7 | 7.8 | 18.9 |
| Shades-of-Gray [30] | ✓ | – | 2.3 | 7.3 | 6.8 | 6.9 | 12.8 |
| General-gray world [14] | ✓ | – | 2.0 | 6.6 | 5.9 | 6.1 | 12.4 |
| Pixel-based Gamut [31] | ✓ | – | 1.7 | 6.0 | 4.4 | 4.9 | 12.9 |
| Top-down [32] | ✓ | – | 2.3 | 6.0 | 4.6 | 5.0 | 10.2 |
| Spacial Correlations [34] | ✓ | – | 1.9 | 5.7 | 4.8 | 5.1 | 10.9 |
| Bottom-up [32] | ✓ | – | 2.3 | 5.6 | 4.9 | 5.1 | 10.2 |
| Edge-based Gamut [31] | ✓ | – | 0.7 | 5.5 | 3.3 | 3.9 | 13.8 |
| CC-GANs (Pix2Pix) [5] | – | ✓ | 1.2 | 3.6 | 2.8 | 3.1 | 7.2 |
| CC-GANs (CycleGAN) [5] | – | ✓ | 0.7 | 3.4 | 2.6 | 2.8 | 7.3 |
| CC-GANs (StarGAN) [5] | – | ✓ | 1.7 | 5.7 | 4.9 | 5.2 | 10.5 |
| FFCC (model Q) [33] | – | ✓ | 0.3 | 2.0 | 1.1 | 1.4 | 5.1 |
| DS-Net [3] | – | ✓ | 0.3 | 1.9 | 1.1 | 1.4 | 4.8 |
| CCC[4] | – | ✓ | 0.3 | 2.0 | 1.2 | 1.4 | 4.8 |
| Bianco CNN [1] | – | ✓ | 0.8 | 2.6 | 2.0 | 2.1 | 4.0 |
| FC4(SqueezeNet) [2] | – | ✓ | 0.4 | 1.7 | 1.2 | 1.3 | 3.8 |
| C3AE, fine-tuned | – | ✓ | 0.8 | 2.1 | 1.9 | 2.0 | 4.0 |
| C3AE, composite-loss | – | ✓ | 0.8 | 2.3 | 2.0 | 2.0 | 3.9 |
| Method | set |
Best
25% |
Mean | Med. | Tri. |
W.
25% |
|---|---|---|---|---|---|---|
| training | 0.3 | 1.7 | 1.1 | 1.3 | 4.0 | |
| Bianco [1] | field | 1.1 | 4.5 | 3.7 | 3.8 | 9.2 |
| non-field | 1.8 | 6.2 | 5.3 | 5.5 | 12.4 | |
| training | 0.6 | 1.6 | 1.7 | 2.1 | 4.5 | |
| FC4 [2] | field | 1.7 | 4.3 | 4.1 | 4.2 | 7.4 |
| (SqueezeNet) | non-field | 1.5 | 4.8 | 4.2 | 4.3 | 9.0 |
| training | 0.8 | 3.0 | 2.4 | 2.6 | 6.2 | |
| C3AE | field | 1.6 | 4.4 | 4.0 | 4.2 | 7.9 |
| fine-tuned | non-field | 1.6 | 5.2 | 4.6 | 4.7 | 10.1 |
| training | 0.7 | 4.7 | 2.6 | 3.3 | 12.0 | |
| C3AE | field | 2.0 | 6.1 | 5.3 | 5.4 | 10.7 |
| composite-loss | non-field | 1.9 | 6.2 | 5.3 | 5.4 | 14.4 |
| training | 0.5 | 1.6 | 1.6 | 1.9 | 10.6 | |
| C3AE, | field | 4.1 | 6.5 | 6.3 | 7.4 | 14.7 |
| w.o pre-training | non-field | 4.9 | 7.3 | 7.3 | 8.3 | 20.4 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Color Constancy Convolutional Autoencoder
††thanks: This work was supported by the NSF-Business Finland Center for Visual and Decision Informatics project (CVDI) sponsored by Intel Finland.
Firas Laakom*†, Jenni Raitoharju†, Alexandros Iosifidis⋆, Jarno Nikkanen††, Moncef Gabbouj†*
*†*Tampere University, Faculty of Information Technology and Communication Sciences, Finland
*⋆*Aarhus University, Department of Engineering, Denmark
†† Intel Corporation, Finland
Abstract
In this paper, we study the importance of pre-training for the generalization capability in the color constancy problem. We propose two novel approaches based on convolutional autoencoders: an unsupervised pre-training algorithm using a fine-tuned encoder and a semi-supervised pre-training algorithm using a novel composite-loss function. This enables us to solve the data scarcity problem and achieve competitive, to the state-of-the-art, results while requiring much fewer parameters on ColorChecker RECommended dataset. We further study the over-fitting phenomenon on the recently introduced version of INTEL-TUT Dataset for Camera Invariant Color Constancy Research, which has both field and non-field scenes acquired by three different camera models.
Index Terms:
Color constancy, illumination estimation, pre-training, convolutional autoencoders
I Introduction
Objects exhibit different colors under various light sources. The goal of color constancy algorithms is to remove this effect. This can be done by first estimating the color of the light source and using this illuminant estimate to transform the image as if it was taken under a neutral white light source. The aim of this transformation is not to scale the brightness level of the image, as color constancy methods only correct for the chromaticity of the light source.
Suppose we have the color of the unknown light source , the surface reflectance at location (x,y), , and the camera sensitivity function , where is the color channel, i.e., . Then the measured image color values at every pixel can be expressed as
[TABLE]
where is the wave length. Color constancy methods aim to estimate the color of the scene illuminant, i.e., the projection of on the sensor spectral sensitivities:
[TABLE]
This problem is usually simplified by assuming a uniform light source color across the scene, i.e., .
Deep neural networks have been recently extensively used to approximate illumination and have often led to state of the art performance [1, 2, 3, 5, 4] across multiple datasets [6, 7, 8]. However, most of these supervised approaches were evaluated with a test data that is very similar to the training data, i.e., usually the training set and test set are acquired with the same camera models and they have similar types of scenes. In this paper, we highlight some limitations of the supervised approaches by taking the testing scenario to the extreme. We show that supervised models usually when trained on images from a single camera and a single scene type end up learning the parameters of that camera and scene and are not able to generalize effectively across other cameras and scenes. The over-fitting problem is a general issue in deep learning and it has been studied extensively. Erhan et al. [9, 10] suggested that unsupervised pre-training makes it possible to obtain solutions that are similar in terms of training error but substantially better in terms of test error. They suggested that unsupervised pre-training has a dual effect both in helping optimization to start in better parameter space basins of attraction and as a kind of regularizer for the network.
State of the art color constancy convolutional neural networks based approaches [2, 3, 5] usually use the first convolutional layers of a pre-trained model, e.g., SqueezeNet [11], AlexNet [12]. While convolutional layers are proven to be effective in color constancy [1, 2, 4] in general, these pre-trained networks are originally trained for a classification task. Classification tasks benefit from being agnostic to illumination color. This makes their usage in color constancy counter-intuitive as illumination information must be preserved by the first layers to be able to detect it. Autoencoders provide a promising paradigm to use unsupervised pre-training for the color constancy context. A trained convolutional autoencoder can uncover the underlying structure of image chromaticities by learning over large numbers of unlabeled images that can be collected, for example, from the Internet. This can help generalize to unseen scenes and cameras without the need of very deep networks.
In this paper, we propose two novel approaches based on unsupervised pre-training using autoencoders. In the first, we learn a common representations of images and then fine-tune the model to estimate the illumination. In the second approach, we combine the two steps into one using a composite objective function which allows us to learn to reconstruct and, at the same time, regress to the illumination.
II Related work
Typically, color constancy algorithms are divided into two main categories, namely unsupervised methods and supervised methods. The former involve methods with static parameters settings which are based on low-level statistics [13, 14, 15, 16, 17] and methods using physics-based dichromatic reflection model [18, 19, 20, 21], while the latter involve data-driven approaches that learn to estimate the illuminant in a supervised manner using labeled data. Supervised methods can be further divided into two main categories: characterization-based methods and training-based methods. The former involve characterization of camera response in one way or another, such as Gamut Mapping [22], which assumes that in a real world scenario, for a given illuminant, only a limited number of colors can be observed. The latter involve methods that try to learn illumination directly from the scene [25, 26, 23, 24]. One group of training-based methods considers different illumination estimation approaches and learns a model that uses the best performing method or a combination of methods to estimate the illuminant of each input based on certain scene characteristics [24]. Another group of learning-based methods uses deep learning based approaches to solve the illumination estimation problem.
The first attempt to use convolutional neural networks (CNNs) for solving the illuminant estimation problem was done by Bianco et al. [1], where they adopted a CNN architecture operating on small local patches to overcome the limited number of training images available. In the testing phase, a map of local estimates is pooled to obtain one global illuminant estimate. For this approach, median pooling was shown to outperform other types of pooling techniques. Shi et al. [3] proposed a network with two interacting sub-networks to estimate the illumination. One sub-network, called hypotheses network, is employed to generate multiple plausible illuminant estimations depending on the patches in the scene. The second sub-network, called the selection network, is trained to select the best estimate generated by the first sub-network. Das et al. formulated the illumination estimation task as an image-to-image translation task [5] they used a Generative Adversarial Network (GAN) to solve it. Barron [4] reformulated the problem of color constancy as a 2D spatial localization task, in order to directly learn how to discriminate between correctly white-balanced images and poorly white-balanced images. Another CNN-based approach was proposed by Hu et al [2]. They introduced a novel pooling layer, namely Confidence-weighted pooling layer in an end-to-end learning process. In their approach, patches in an image can carry different confidence weights according to the value they provide for color constancy estimation. In this deep model, pre-trained layers from SqueezeNet [11] and AlexNet [12] were used.
III Proposed approach
While the current state of the art CNN-based methods use very deep models with convolution layers of a pre-trained model, we argue that unsupervised pre-training of a convolutional autoencoder may avoid overfitting without the need to go very deep. Training a Convolutional AutoEncoder (CAE) to reconstruct images and using it to estimate the illumination will allow us to use unlabeled data and thus obtain better parameters for the trained network. Learning to regenerate a large number of images from different cameras and sources will result in a model that will be more camera and scene invariant. We propose two approaches based on autoencoders named Color Constancy Convolutional AutoEncoder (C3AE) fine-tuned and C3AE* composite-loss*.
C3AE* fine-tuned * is a two-step approach. In the first step, an autoencoder is trained to reconstruct both labeled and unlabeled images to learn a latent representation for them using the binary cross-entropy loss. In the second step, the encoder part is fine-tuned to estimate the illumination using the recovery angular error (RAE) as the loss function. RAE is a typical error measure in color constancy (see Section IV-D).
In C3AE* composite-loss* approach, we combine the two steps of C3AE* fine-tuned* into one semi-supervised process. We train an autoencoder with a code size (middle layer) composed of only three neurons and we reconstruct the images (labeled and unlabeled) while forcing at the same time the middle layer to regress to the desired illumination for the labeled samples. For this purpose, we modify the loss function of the autoencoder in the following manner:
[TABLE]
where is the labeled domain, is the unlabeled domain, is the cardinality operator, i.e., number of elements in a set, is the binary cross-entropy loss, is the angular loss (given in (4)) between the estimated illumination in the bottleneck of the autoencoder and the ground truth illumination . The scaling by 1/90 makes the two losses of the same order of magnitude. The weight is set as a hyperparameter. Intuitively, encodes the weights of the two terms in the loss function. A small value means prioritizing the second term, i.e., learning to estimate the illumination, and a large value means prioritizing the first term, i.e, learning to reconstruct the images. To minimize Eq. (3), the autoencoder has to learn to reconstruct both at the labeled and unlabeled domain while matching the bottleneck as much as possible to the ground truth illumination for the labeled domain. In the last stage, the encoder part is fine-tuned using the labeled samples only.
IV Experimental setup
IV-A Network architectures
We use a fully convolutional autoencoder which consists of four blocks of convolution, maxpooling, and dropout layers. The convolution filters are selected to be 32 of size 55 in the first two layers, 32 of size 44 in the third one, and 256 of size 3*3 in the fourth layer with an additional convolutional layer in the middle and the corresponding symmetric layers in the decoder.
For C3AE* fine-tuned*, the middle layer size is 50. The training is conducted with 1000 epochs and a batch size of 10. For fine-tuning, in order to make the network suitable for illumination estimation, we add two layers on top of the trained encoder: one of size 15 and the other of size 3. The fine-tuning is conducted with 1000 epochs and a batch size of 20. For C3AE* composite-loss*, the middle layer size is 3 and is equal to 0.5. Both trainings are conducted on image patches of size 64*64.
IV-B Image datasets
IV-B1 ColorChecker RECommended dataset
111http://www.cs.sfu.ca/ colour/data/shi_gehler/
ColorChecker RECommended dataset [6] is an updated version of Gehler-Shi dataset [7] with a new proposed ’recommended’ ground truth to use for evaluation. This dataset contains 568 high-quality mixed indoor and outdoor images acquired by two cameras: Canon 1D and Canon 5D. We use this dataset to evaluate the approaches in the first scenario, where the test set is similar to the training set.
IV-B2 INTEL-TUT2
INTEL-TUT2 is the second version of INTEL-TUT dataset [8]. The main strength of this dataset is that it contains several camera models and several types of scenes organized separately. We use this dataset in the second training scenario, where the models are trained only with images acquired by one camera and containing one type of scene. The models are then tested on the other cameras and scenes.
This publicly available222http://urn.fi/urn:nbn:fi:csc-kata20170901151004490662 dataset contains images taken with three cameras (namely Canon, Nikon, and Mobile). The images are divided into four sets: field (144 images per camera), lab printouts (300 images per camera), lab real scenes (4 images per camera), and field2. The last set field2 contains only images taken by Canon and it has in total 692 images. We use this last set for training and validation and the rest of the sets for the testing.
IV-B3 Tiny ImageNet
As unlabeled data, we used Tiny ImageNet333https://tiny-imagenet.herokuapp.com, which is a smaller version of the original ImageNet [27]. We use 10k randomly selected images from this dataset. The diversity of ImageNet plays an essential role in this process. We believe that an autoencoder, which is trained to reconstruct this dataset, will encode a strong image dictionary. This will result in a stronger ability to generalize and help to build a robust illuminant estimator.
IV-C Evaluation procedure
For the first experiment, we used ColorChecker RECommended dataset. Similarly to [1, 2], we used a three-fold cross validation on the folds provided with the dataset: for each run, one is used for training, one for validation, and the remaining one for testing
For the second experiment, we used only Canon field2 set for training and validation (80% for training and 20% for validation). We constructed two test sets. The first one, referred to here as field, contains all the field images taken by the other camera models, i,e., Nikon and Mobile. The second set, referred to here as non-field contains all the non-field images acquired by Nikon and Mobile. This allowed us to test both scene and camera invariance of the models.
As in INTEL-TUT2 dataset different camera models are used, the variation of camera spectral sensitivity needs to be discounted. For this purpose, we utilize Color Conversion Matrix (CCM) based preprocessing[28] to learn 3*3 CCM matrices for each camera pair.
For all the comparative experiments, data augmentation was performed as specified in the original works [1, 2]. For our models, we first downscaled the color constancy dataset images to 19201080 and randomly cropped 6464 patches of these downscaled images. The crops were rotated by a random angle between -30°and +30°and, while training, we rescaled the patches and the corresponding ground truths by random RGB values in the range of [0.8, 1.2]. In testing, the images were first downscaled by 50% in both axes and then 5 random 64*64 patches were selected from the image. This allowed us to generate a map of local estimates. We took the median of these estimates as the global illumination estimate.
IV-D Loss and evaluation metrics
For better insights into the robustness of the proposed methods, we report the mean of the top 25%, the mean, the median, Tukey’s trimean, and the mean of the worst 25% of the recovery angular error (RAE) [29] between the ground truth illuminant and the estimated illuminant:
[TABLE]
where is the ground truth illumination for an image and is the estimated illumination.
V Experimental results
V-A Results on ColorChecker RECommended dataset
We first evaluated accuracy of the approaches on ColorChecker RECommended dataset as shown in Table I. We provide results for the static methods Grey-World, White-Patch, Shades-of-Grey, and General Grey-World. The parameter values , , are set as described in [15]. In addition, we compare with Pixel-based Gamut, Bright Pixels, Spatial Correlations and six convolutional approaches: Deep Specialized Network for Illuminant Estimation (DS-Net) [3], Bianco CNN [1], Fast Fourier Color Constancy [33], Convolutional Color Constancy[4], Fully Convolutional Color Constancy With Confidence-Weighted Pooling (FC4) [2], and Color Constancy GANs (CC-GANs) [5].
In this training scenario, training, validation, and test sets are similar in the sense that all of them contain images acquired with both camera models: Canon 1D and Canon 5D and various types of scenes. In this experiment, we note that learning-based methods usually outperform statistical-based methods across all error metrics. This can be explained by the fact that statistical approaches rely on some assumptions in their model. These assumptions can be violated in some testing samples and thus result in high error rates especially in terms of the worst 25%.
In Table I, we note also that DS-Net, CCC, and FFCC achieve better error rates in terms of mean, median and trimean than our proposed method C3AE and its variants. But these methods are not stable and fail to generalize for many examples in the dataset. This can be seen through the worst 25% error metric. The mean of the worse 25% is bigger than 4.8°for these methods compared to 3.9°and 4°for our methods. Furthermore, by comparing the number of parameters required by each model given in Table II, we see that C3AE achieves very competitive results, while using less than 1% of the parameters of DS-Net.
TableI also shows that both of our proposed methods performs similar to Bianco CNN w.r.t all metrics, except for the mean metric, where C3AEs outperform Bianco CNN. The proposed approach shows competitive results compared to FC4, the error difference being less than for all the evaluation metrics, while using less than 10% of the parameters. By comparing the number of parameters required by each model in Table II, we see that C3AEs and its variant use less than 1% of the parameters of FC4(SqueezeNet).
C3AE* fine-tuned* and C3AE* composite-loss* achieve similar results, with C3AE* fine-tuned* performing better in terms of the mean error metric and C3AE* composite-loss* performing better in the mean of the worst 25%.
V-B Results on INTEL-TUT2 dataset
Table III reports the comparative results and the numbers of parameters for the CNN based approaches: Bianco CNN, FC4 (squuezeNet), C3AE* fine-tuned*, and C3AE* composite-loss* trained on INTEL-TUT2 dataset. To investigate the effect of pre-training on the performance of our approaches, we also provide results for C3AE without pre-training. We provide the error metrics on three sets: the training set, the field, and non-field sets described in Section IV-C.
In this extreme scenario, the models are trained on field2 samples acquired with Canon. Then the testing is performed on images acquired with other cameras and other type of scenes. For all the methods, we note a significant difference between the training errors and the test errors, i.e., most of the error metrics in both test sets have increased by a factor of 2-3 compared to the training errors. We note a slightly lower factor in our two proposed methods specially in terms of the worst 25%. We also note that despite the fact that Bianco CNN has a better training error rates than our methods, C3AE* fine-tuned* shows more generalization ability and outperforms Bianco CNN in almost all test error metrics. C3AE* fine-tuned* shows competitive results compared to FC4 while using only 10% of the parameters.
As we see in Table III, unsupervised pre-training yields a much better generalization ability than semi-supervised pre-training in almost all error metrics. In comparison with the method without pre-training, we note that pre-training indeed helps and yields more robust methods. This can be explained by the fact that the autoencoder was trained with a diverse dataset containing images acquired with multiple cameras. This resulted in a robust initialization for the algorithms, which in turn resulted in models that can better generalize to different cameras and scenes.
Figure 1 presents three samples from INTEL-TUT2 dataset, alongside their corresponding correction using C3AE*, fine-tuned* and their ground truth.
VI Conclusion
In this paper, illumination estimation algorithms were evaluated and compared on ColorChecker RECommended dataset. In addition, we tested the generalization ability of these algorithms in an extreme scenario with the second version of INTEL-TUT dataset, where color constancy approaches were trained using images only from one field set acquired with one camera and tested on images acquired with different camera models and on different scenes. We found that their performance drops significantly and they fail to some extent to generalize.
We proposed a method, C3AE, that exploits convolutional autoencoders and unsupervised pre-training to improve the generalization ability. With the proposed approach, we achieved comparable results to the state of the art methods using much fewer parameters.
Extensions of the proposed approach could include the use of other unsupervised pre-training techniques, such as variational convolutional autoencoders, in order to improve the generalization power from fewer examples.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. Bianco, C. Cusano, and R. Schettini, “Color constancy using CN Ns,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops , 2015, pp. 81–89.
- 2[2] Y. Hu, B. Wang, and S. Lin, “FC 4: Fully convolutional color constancy with confidence-weighted pooling,” in IEEE Conference on Computer Vision and Pattern Recognition , 2017, pp. 4085 – 4094.
- 3[3] W. Shi, C. C. Loy, and X. Tang, “Deep specialized network for illuminant estimation,” in European Conference on Computer Vision . Springer, 2016, pp. 371–387.
- 4[4] J. T. Barron, “Convolutional color constancy,” 2015 IEEE International Conference on Computer Vision (ICCV) , pp. 379–387, 2015.
- 5[5] P. Das, A. S. Baslamisli, Y. Liu, S. Karaoglu, and T. Gevers, “Color constancy by gans: An experimental survey,” Computing Research Repository , 2018.
- 6[6] G. Hemrit, G. Finlayson, A. Gijsenij, P. Gehler, S. Bianco, B. Funt, M. Drew, and L. Shi, “Rehabilitating the colorchecker dataset for illuminant estimation,” in Color and Imaging Conference , 2018, pp. 350–353.
- 7[7] P.V. Gehler, C. Rother, A. Blake, T. Minka, and T. Sharp, “Bayesian color constancy revisited,” in IEEE Conference on Computer Vision and Pattern Recognition , 2008, pp. 1–8.
- 8[8] C. Aytekin, J. Nikkanen, and M. Gabbouj, “A Data Set for Camera-Independent Color Constancy,” IEEE Transactions on Image Processing , vol. 27, pp. 530–544, 2018.
