TL;DR
This paper introduces BoCF, a new color constancy method that uses Bag-of-Features pooling to reduce parameters, aligns with statistical assumptions, and employs attention mechanisms to improve illumination estimation.
Contribution
The paper presents a novel Bag of Color Features approach with attention mechanisms, reducing parameters and improving accuracy in color constancy tasks.
Findings
Achieves competitive results with fewer parameters.
Effective on multiple benchmark datasets.
Attention variants enhance estimation accuracy.
Abstract
In this paper, we propose a novel color constancy approach, called Bag of Color Features (BoCF), building upon Bag-of-Features pooling. The proposed method substantially reduces the number of parameters needed for illumination estimation. At the same time, the proposed method is consistent with the color constancy assumption stating that global spatial information is not relevant for illumination estimation and local information ( edges, etc.) is sufficient. Furthermore, BoCF is consistent with color constancy statistical approaches and can be interpreted as a learning-based generalization of many statistical approaches. To further improve the illumination estimation accuracy, we propose a novel attention mechanism for the BoCF model with two variants based on self-attention. BoCF approach and its variants achieve competitive, compared to the state of the art, results while requiring…
| Method | # par | Best 25% | Mean | Med. | Tri. | Worst 25% |
| BoCF(2conv+50 words + no attention) | 13k | 0.4 | 2.2 | 1.6 | 2.0 | 5.3 |
| BoCF(2conv+150 words + no attention) | 20k | 0.3 | 2.1 | 1.5 | 1.6 | 5.1 |
| BoCF(2conv+200 words + no attention) | 23k | 0.3 | 2.0 | 1.5 | 1.6 | 5.2 |
| BoCF(3conv+150 words+ no attention) | 37k | 0.3 | 2.2 | 1.4 | 1.8 | 5.1 |
| BoCF(2conv+50 words + attention1) | 369k | 0.4 | 2.0 | 1.3 | 2.0 | 5.1 |
| BoCF(2conv+150 words + attention1) | 376k | 0.3 | 2.0 | 1.3 | 1.5 | 4.7 |
| BoCF(2conv+200 words + attention1) | 380k | 0.3 | 2.0 | 1.2 | 1.5 | 5.0 |
| BoCF(2conv+50 words + attention2) | 15k | 0.4 | 2.2 | 1.5 | 1.6 | 5.1 |
| BoCF(2conv+150 words + attention2) | 43k | 0.3 | 2.0 | 1.2 | 1.4 | 4.8 |
| BoCF(2conv+200 words + attention2) | 63k | 0.3 | 2.0 | 1.3 | 1.5 | 4.8 |
| Method | Best 25% | Mean | Med. | Tri. | Worst 25% |
|---|---|---|---|---|---|
| BoCF | 0.3 | 2.1 | 1.5 | 1.6 | 5.1 |
| BoCF-1 | 0.4 | 2.9 | 1.9 | 2.2 | 6.9 |
| BoCF-2 | 0.5 | 2.4 | 1.7 | 1.7 | 5.7 |
| Method | Type | Best 25% | Mean | Med. | Tri. | Worst 25% | |
| statistic-based | learning-based | ||||||
| Grey-World [24] | ✓ | – | 5.0 | 9.7 | 10 | 10 | 13.7 |
| White-Patch [23] | ✓ | – | 2.2 | 9.1 | 6.7 | 7.8 | 18.9 |
| Shades-of-Gray [52] | ✓ | – | 2.3 | 7.3 | 6.8 | 6.9 | 12.8 |
| General-gray world [24] | ✓ | – | 2.0 | 6.6 | 5.9 | 6.1 | 12.4 |
| Pixel-based Gamut [30] | ✓ | – | 1.7 | 6.0 | 4.4 | 4.9 | 12.9 |
| Top-down [53] | ✓ | – | 2.3 | 6.0 | 4.6 | 5.0 | 10.2 |
| Spatial Correlations [54] | ✓ | – | 1.9 | 5.7 | 4.8 | 5.1 | 10.9 |
| Bottom-up [53] | ✓ | – | 2.3 | 5.6 | 4.9 | 5.1 | 10.2 |
| Edge-based Gamut [30] | ✓ | – | 0.7 | 5.5 | 3.3 | 3.9 | 13.8 |
| CC-GANs (Pix2Pix) [35] | – | ✓ | 1.2 | 3.6 | 2.8 | 3.1 | 7.2 |
| CC-GANs (CycleGAN) [35] | – | ✓ | 0.7 | 3.4 | 2.6 | 2.8 | 7.3 |
| CC-GANs (StarGAN) [35] | – | ✓ | 1.7 | 5.7 | 4.9 | 5.2 | 10.5 |
| FFCC (model Q) [50] | – | ✓ | 0.3 | 2.0 | 1.1 | 1.4 | 5.1 |
| Cheng et al. 2015 [55] | – | ✓ | 0.4 | 2.4 | 1.7 | 1.7 | 5.9 |
| DS-Net [7] | – | ✓ | 0.3 | 1.9 | 1.1 | 1.4 | 4.8 |
| CCC[51] | – | ✓ | 0.3 | 2.0 | 1.2 | 1.4 | 4.8 |
| Bianco CNN [5] | – | ✓ | 0.8 | 2.6 | 2.0 | 2.1 | 4.0 |
| FC4(SqueezeNet) [6] | – | ✓ | 0.4 | 1.7 | 1.2 | 1.3 | 3.8 |
| BoCF(2conv+150 words + no attention) | – | ✓ | 0.3 | 2.1 | 1.5 | 1.6 | 5.1 |
| BoCF(2conv+150 words + attention1) | – | ✓ | 0.3 | 2.0 | 1.3 | 1.5 | 4.7 |
| BoCF(2conv+150 words + attention2) | – | ✓ | 0.3 | 2.0 | 1.2 | 1.4 | 4.8 |
| Method | Type | Best 25% | Mean | Med. | Tri. | Worst 25% | |
| statistic-based | learning-based | ||||||
| Grey-World [24] | ✓ | – | 0.9 | 4.1 | 3.2 | 3.4 | 9.0 |
| White-Patch [23] | ✓ | – | 1.9 | 10.6 | 10.6 | 10.5 | 19.4 |
| Shades-of-Gray [52] | ✓ | – | 0.8 | 3.4 | 2.6 | 2.7 | 7.4 |
| General-gray world [24] | ✓ | – | 0.7 | 3.2 | 2.4 | 2.5 | 7.1 |
| Pixel-based Gamut [30] | ✓ | – | 2.5 | 7.7 | 6.7 | 6.9 | 14.0 |
| Bright Pixels [56] | ✓ | – | 0.7 | 3.2 | 2.4 | 2.6 | 7.0 |
| Edge-based Gamut [30] | ✓ | – | 2.4 | 8.4 | 7.0 | 7.4 | 16.1 |
| Bayesian [11] | – | ✓ | 0.8 | 3.7 | 2.7 | 2.9 | 8.2 |
| Cheng et al. 2015 [55] | – | ✓ | 0.6 | 2.9 | 2.0 | 2.2 | 6.6 |
| DS-Net [7] | – | ✓ | 0.5 | 2.2 | 1.5 | 1.7 | 6.1 |
| CCC[51] | – | ✓ | 0.5 | 2.4 | 1.5 | 1.7 | 5.9 |
| Regression Tree [55] | – | ✓ | 0.5 | 2.4 | 1.6 | 1.7 | 5.5 |
| Bianco[5] | – | ✓ | 0.3 | 2.6 | 2.0 | 2.1 | 3.9 |
| FC4(SqueezeNet) [6] | – | ✓ | 0.5 | 2.2 | 1.5 | 1.7 | 5.2 |
| FC4(AlexNet) [6] | – | ✓ | 0.5 | 2.1 | 1.6 | 1.7 | 4.8 |
| BoCF(2conv+150 words + no attention) | – | ✓ | 0.6 | 2.5 | 1.6 | 1.8 | 5.6 |
| BoCF(2conv+150 words + attention1) | – | ✓ | 0.5 | 2.3 | 1.4 | 1.7 | 5.2 |
| BoCF(2conv+150 words + attention2) | – | ✓ | 0.5 | 2.3 | 1.5 | 1.7 | 5.1 |
| Method | set |
Best
25% |
Mean | Med. | Tri. |
W.
25% |
|---|---|---|---|---|---|---|
| Bianco [5] | field | 1.1 | 4.5 | 3.7 | 3.8 | 9.2 |
| non-field | 1.8 | 6.2 | 5.3 | 5.5 | 12.4 | |
| C3AE [57] | field | 1.6 | 4.4 | 4.0 | 4.2 | 7.9 |
| fine-tuned | non-field | 1.6 | 5.2 | 4.6 | 4.7 | 10.1 |
| C3AE [57] | field | 2.0 | 6.1 | 5.3 | 5.4 | 10.7 |
| composite-loss | non-field | 1.9 | 6.2 | 5.3 | 5.4 | 14.4 |
| FC4 [6] | field | 1.7 | 4.3 | 4.1 | 4.2 | 7.4 |
| non-field | 1.5 | 4.8 | 4.2 | 4.3 | 9.0 | |
| BoCF (150 w) | field | 1.7 | 4.6 | 4.1 | 4.2 | 8.1 |
| No attention | non-field | 1.5 | 4.9 | 4.2 | 4.4 | 9.5 |
| BoCF (150 w) | field | 1.9 | 4.5 | 4.1 | 4.2 | 7.3 |
| attention1 | non-field | 1.5 | 4.9 | 4.2 | 4.3 | 9.0 |
| BoCF (150 w) | field | 1.7 | 4.4 | 4.1 | 4.2 | 7.5 |
| attention2 | non-field | 1.5 | 4.9 | 4.3 | 4.4 | 9.1 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Bag of Color Features For Color Constancy
Firas Laakom, Nikolaos Passalis, Jenni Raitoharju, Jarno Nikkanen, Anastasios Tefas, Alexandros Iosifidis, and Moncef Gabbouj F. Laakom, N. Passalis, J. Raitoharju, and M. Gabbouj are with Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland (e-mail: [email protected]; [email protected]; [email protected]; [email protected]).A. Iosifidis is with the Department of Engineering, Electrical and Computer Engineering, Aarhus University, DK-8200 Aarhus, Denmark (e-mail: [email protected]).A. Tefas is with the Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece (e-mail: [email protected])J. Nikkanen is with INTEL, Insinöörinkatu 41, 33720 Tampere, Finland (email: [email protected])
Abstract
In this paper, we propose a novel color constancy approach, called Bag of Color Features (BoCF), building upon Bag-of-Features pooling. The proposed method substantially reduces the number of parameters needed for illumination estimation. At the same time, the proposed method is consistent with the color constancy assumption stating that global spatial information is not relevant for illumination estimation and local information (edges, etc.) is sufficient. Furthermore, BoCF is consistent with color constancy statistical approaches and can be interpreted as a learning-based generalization of many statistical approaches. To further improve the illumination estimation accuracy, we propose a novel attention mechanism for the BoCF model with two variants based on self-attention. BoCF approach and its variants achieve competitive, compared to the state of the art, results while requiring much fewer parameters on three benchmark datasets: ColorChecker RECommended, INTEL-TUT version 2, and NUS8.
Index Terms:
Color constancy, illumination estimation, Bag of Features, attention mechanism
1 Introduction
Color constancy in general is the ability of an imaging system to discount the effects of illumination on the observed colors in a scene [1, 2]. When a person stands in a room lit by a colorful light, the Human Visual System (HVS) unconsciously removes the lightening effects and the colors are perceived as if they were illuminated by a neutral, white light. While this ability is very natural for the HVS, mimicking the same ability in a computer vision system is a challenging and under-constrained problem. Given a green pixel, one can not assert if it is a green pixel under a white illumination or a white pixel lit with a greenish illumination. Nonetheless, illumination estimation is considered an important component of many higher level computer vision tasks such as object recognition and tracking. Thus, it has been extensively studied in order to develop reliable color constancy systems which can achieve illumination invariance to some extent [1, 3].
The RGB image value in the position of an image can be expressed as a function depending on three key factors [3]: the illuminant distribution , the surface reflectance and the camera sensitivity , where is the wave length. This dependency is expressed as
[TABLE]
Color constancy methods [3, 4] aim to estimate a uniform projection of on the sensor spectral sensitivities , i.e.,
[TABLE]
where is the global illumination of the scene.
Recently, deep learning approaches and Convolutional Neural Networks in particular have become dominant in almost all computer vision tasks, including color constancy [5, 6, 7, 8] due to their ability to take raw images directly as input and incorporate feature extraction in the learning process [9]. Despite their accuracy in estimating illumination across multiple datasets [10, 11, 6], deploying CNN-based approaches on low computational power devices, e.g., mobile devices, is still limited, since most of the high-accuracy deep models are computationally expensive [6, 7, 8], which make them inefficient in terms of time and energy consumption. Additionally, most of the available datasets for illumination estimation are rather small-scale [12, 13, 10] and hence not suitable for training large models. For this purpose, many state of the art approaches [5, 6] rely on pre-trained networks to overcome this limitation. On the other hand, these pre-trained networks [14, 9] are originally trained for a classification task. Thus, they are usually agnostic to the illumination color. This makes their usage in color constancy counter-intuitive as the illumination information is distorted in the early pre-trained layers. An alternative approach is of course to reduce the number of model parameters in order to use existing datasets, as shallower models, in general, need less examples to learn.
Furthermore, in [13, 15] it is argued that global spatial information is not an important feature in color constancy. The local information, i.e., the color distribution and the color gradient distribution (i.e. edges) can be sufficient to extract the illumination information [13]. Thus, using regular neural networks configurations to extract deep features is counter-intuitive in this particular problem. To address these drawbacks and challenges, we propose in this paper a novel color constancy deep learning approach called Bag of Color Features (BoCF). BoCF uses Bag-of-Features Pooling[16], which takes advantage of the ability of CNNs to learn relevant shallow features while keeping the model suitable for low-power hardware. Furthermore, the proposed approach is consistent with the assumption that global spatial information is not relevant [13, 15] for color illumination estimation.
Bag-of-Features Pooling is a neural extension [17, 16] of the famous Bag-of-Features model (BoF), also known as Bag-of-Visual Words (BoVW)[18, 19]. BoFs are wildly used in computer vision tasks, such as action recognition [20], object detection/recognition, sequence classification [21], and information retrieval [22]. A BoF layer can be combined with convolutional layers to form a powerful convolutional architecture that is end-to-end trainable using the regular back-propagation algorithm [17].
The block diagram of the proposed BoCF model is illustrated in Figure 1. It consists of three main blocks: feature extraction block, Bag of Features block, and an estimation block. In the first block, regular convolutional layers are used to extract relevant features. Inspired by the assumption that second order gradient information is sufficient to extract the illumination information [13], we use only two convolutional layers to extract the features. In our experiments, we also study and validate this hypothesis empirically. In the second block, i.e., the Bag of Features block, the network learns the dictionary using back-propagation[17] over the non-linear transformation provided by the first block. This block outputs a histogram representation, which is fed to the last component, i.e., the estimation block, to regress to the scene illumination.
In most CNN-based approaches used to solve the color constancy problem [5, 6, 7, 8], fully connected layers are connected directly to a flattened version of the last convolutional layer output. This increases the numbers of parameters dramatically, as convolutional layer outputs usually have a high dimensionality. In the proposed method, we address this problem by introducing an intermediate pooling block, i.e., the Bag of Features block, between the last convolutional layer and the fully connected layers. The proposed model achieves comparable results to previous state of the art illumination estimation methods while substantially reducing the number of the needed parameters, by up to 95%. Additionally, the pooling process natively discards all global spatial information, which is, as discussed earlier, irrelevant for color constancy. Using only two convolutional layers in the first block, limits the model to only shallow features. These two advantages make proposed approach both consistent and in full corroboration with statistical approaches [13].
To further improve the performance of the proposed model, we also propose two variants of a self-attention mechanism for the BoCF model. In the first variant, we add an attention mechanism between the feature extraction block and the Bag of Features block. This mechanism allows the network to dynamically select parts of the image to use for estimating the illumination, while discarding the remaining parts. Thus, the network becomes robust to noise and irrelevant features. In the second variant, we add an attention mechanism on top of the histogram representation, i.e., between the Bag of Features block and the estimation block. In this way, we allow the network to learn to adaptively select the elements of the histogram which best encode the illuminant information. The model looks over the whole histogram after the spatial information has been discarded and generates a proper representation according the current context (histogram). The introduced dynamics will be shown in the experiments to enhance the model performance with respect to all evaluation metrics and across all the datasets.
The main contributionsof the paper are as follows:
- •
We propose a novel CNN-based color constancy algorithm, called BoCF, based on Bag-of-Features Pooling. The proposed model is both shallow and able to achieve competitive results across multiple datasets compared to the state of the art.
- •
We establish explicit links between BoCF and prior statistical methods for illumination estimation and show that the proposed method can be framed as a learning-based generalization of many statistical approaches. This powerful approach fills the gap and provides the missing links between CNN-based approaches and static approaches.
- •
We propose two novel attention mechanisms for BoCF that can further improve the results. To the best of our knowledge, this is the first work which combines attention mechanism with Bag-of-Features Pooling.
- •
The proposed method is extensively evaluated over three datasets leading to competitive performance with respect to existing state of the art, while substantially reducing the number of parameters.
The rest of this paper is organized as follows. Section 2 provides the background of color constancy approaches and a brief review of the Bag-of-Features Pooling technique and the attention mechanism used in this work. Section 3 details the proposed approach along with the two attention mechanisms based variants. Section 4 introduces the datasets and the evaluation metrics used in this work along with the evaluation procedure. Section 5 presents the experimental results on three datasets: ColorChecker RECommended [12], NUS8-Dataset[13], and INTEL-TUT version2[10]. In Section 6, we highlight the links between our approach and many existing methods and we show how our approach can be considered as a generic framework for expressing existing approaches. Section 7 concludes the paper.
2 Related work
2.1 Color constancy
Typically, two types of color constancy approaches are distinguished, namely static methods and supervised methods. The former involves methods with static parameters settings that do not need any labeled image data for learning the model, while the latter are data-driven approaches that learn to estimate the illuminant in a supervised manner using labeled data.
2.1.1 Static methods
Static methods exploit the statistical or physical properties of a scene by making assumptions about the nature of colors. They can be classified into two categories: methods based on low-level statistics[23, 24, 25, 26] and methods based on the physics-based dichromatic reflection model [4, 27, 15, 28]. A number of approaches belonging to the first category were unified by Van de Weijer et al. [25] into a single framework expressed as follows:
[TABLE]
where denotes the derivative order, the Minkowski norm and the normalization constant for . Also, denotes the image convolution with a Gaussian filter with a scale parameter . This framework allows for deriving different algorithms simply by setting the appropriate values for , and . The well-known Gray-World method [24], corresponding to , assumes that under a neutral illumination the average reflectance in a scene is achromatic and the illumination is estimated as the shift of the image average color from gray. White-Patch [23] , assumes that the maximum values of RGB color channels are caused by a perfectly reflecting surface in the scene. Therefore, the illumination components correspond to these maximum values. Besides Gray-World and White-Patch methods, which make use of the color distribution in the scene to build their estimations, Gray-Edge method [25] utilizes image derivatives. Instead of the global average color, Gray-Edge methods assume that the average color of edges or the gradient of edges is gray. The illuminant’s color is then estimated as the shift of the average edge color from gray.
Physics-based dichromatic reflection models estimate the illumination by analyzing the scene and exploiting the physical interactions between the objects and the illumination. The main assumption of most methods in this category is that all pixels of a surface form a plane in RGB color space. As the scene contains multiple surfaces, this results in multiple planes. The intersection between these planes is used to compute the color of the light source [27]. Lee et al. [15] exploited the bright areas in the captured scene to obtain an estimate of the illuminant color. In this work, we establish links between our proposed approach, BoCF, and several static methods. We show that BoCF can be interpreted as a learning-based extension of several of these approaches.
2.1.2 Supervised methods
Supervised methods can be further divided into two main categories: characterization-based methods [29, 30] and training-based methods[31, 32, 5, 6]. The former involves ’light’ training processes in order to learn the characterization of the camera response in some way, while the latter involves methods that try to learn the illumination directly from the scene.
Gamut Mapping [29, 30] is one of the most famous characterization-based approaches. It assumes that for a given illumination condition, only a limited number of colors can be observed. Thus any unexpected variation in the observed colors is caused by the light source illuminant. The set of colors that can occur under a given illumination, called canonical gamut, is first learned in a supervised manner. In the evaluation, an input gamut which represents the set of colors used to acquire the scene is constructed. The illumination is then estimated by mapping this input gamut to the canonical gamut.
Another group of training-based methods combines different illumination estimation approaches and learns a model that uses the best performing method or a combination of methods to estimate the illuminant of each input based on the scene characteristics [31]. Bianco et al. used indoor/outdoor classification to select the optimal color constancy algorithm given an input image[32]. Lu et al. proposed an approach which exploits 3D scene information for estimating the color of a light source [33]. However, these methods tend to overfit and fail to generalize to all scene types.
The first attempt to use Convolutional Neural Networks (CNNs) for solving the illuminant estimation problem was established by Bianco et al. [5], where they adopted a CNN architecture operating on small local patches to overcome the data shortage. In the testing phase, a map of local estimates is pooled to obtain one global illuminant estimate using median or mean pooling. Hu et al. [6] introduced a pooling layer, namely confidence-weighted pooling. In their fully convolutional network, they incorporate learning the confidence of each patch of the image in an end-to-end learning process. Patches in an image can carry different confidence weights according to their estimated accuracy in predicting the illumination. Shi et al. [7] proposed a network with two interacting sub-networks to estimate the illumination. One sub-network, called the hypothesis network, is used to generate multiple plausible illuminant estimations depending on the patches in the scene. The second sub-network, called the selection network, is trained to select the best estimate generated by the first sub-network. Inspired by the success of Generative Adversarial Networks (GANs) in image to image translation[34], Das et al. formulated the illumination estimation task as an image-to-image translation task [35] and used a Generative Adversarial Network (GAN) to solve it. However, these CNN-based methods suffer from certain weaknesses: computational complexity and disconnection with both the illumination assumption[13] and the prior static methods, e.g., Grey-World [24] and White-Patch [23]. This paper attempts to cure these drawbacks by proposing a novel CNN approach, BoCF, which discards the global spatial information in agreement with [13] and [25], and is competitive with the training-based methods while using only 5% of the parameters.
2.2 * Bag-of-Features Pooling *
Passalis and Tefas proposed a Bag-of-Features Pooling (BoFP) layer [16, 17], which is a neural extension of the Bag-of-Features model (BoF). BoFPL can be combined with convolutional layers to form a powerful architecture which can be trained end-to-end using the regular back-propagation algorithm [17, 36]. In this work, we use this pooling technique to learn the codebook of color features. Thus, the naming Bag of Color Features (BoCF). This pooling discards all the global spatial information and outputs a fixed length histogram representation. This allows us to reduce the large number of parameters usually needed when linking convolutional layers to fully connected layers. Furthermore, discarding global spatial information forces the network to learn to extract the illumination without global spatial inference, thus improving model robustness and adhering to the illumination assumption [13]. As an additional novel feature to the prior works using Bag-of- Features Pooling [17, 36], we propose introducing an attention mechanism to enables the model to discard noise and focus only on relevant parts of the input presentation. To the best of our knowledge, this is the first work which combines attention mechanisms with Bag-of-Features Pooling.
2.3 Attention mechanisms
Attention mechanisms were introduced in Natural Language Processing (NLP) [37] for sequence-to-sequence (seq2seq) models in order to tackle the problem of short-term memory faced in machine translators. They allow a machine translator to see the full information contained in the original input and then generate the proper translation for the current word. More specifically, they allow the model to focus on local or global features, as needed. Self-attention [38], also known as intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence. In other words, the attention mask is computed directly from the original sequence. This idea has been exported to many other problems in NLP and computer vision such as machine reading [39], text summarization [40, 41], and image description generation [42]. In [42], self-attention is applied to an image to enable the network to generate an attention mask and focus on the region of interest in the original image.
Attention in deep learning can be broadly interpreted as a mask of importance weights. In order to evaluate the importance of a single element, such as a pixel or a feature in general, for the final inference, one can form an attention vector by estimating how strongly the element is correlated with the other elements and use this attention vector as a mask when evaluating the final output [42]. Let be a vector. The goal of a self-attention mechanism is to learn to generate a mask vector depending only on x, which encodes the importance weights of the elements of x. Let be a mapping function between x and v. The dependency can be expressed as follows:
[TABLE]
under the constraint:
[TABLE]
After computing the mask vector v, the final output of the attention layer y is computed as follows:
[TABLE]
The concept of attention, i.e., focusing on particular regions to extract the illumination information in color constancy, can be rooted back to many statistical approaches. For example, White-Patch reduces this region to the pixel with the highest RGB values. Other methods, such as [15] focus on the bright areas in the captured scene, called specular highlights. Instead of making such a strong assumption on the relevant regions, in BoCF we allow the model to learn to extract these regions dynamically. To the best of our knowledge, this is the first work, which uses attention mechanisms in the color constancy problem.
3 Proposed approach
In order to reduce the number of parameters needed to learn the illumination [6, 7], we propose a novel color constancy approach based on the Bag-of-Features Pooling [17], called herein the BoCF approach. The proposed approach along with the novel attention variants is illustrated in Figure 2. The proposed model has three main blocks, namely the feature extraction, the Bag of Features, and the illumination estimation blocks. In the first block, a nonlinear transformation of a raw image is obtained. In the second block, a histogram representation of this transform is compiled. This histogram is used in the third block to estimate the illumination.
3.1 Feature extraction
The feature extraction algorithm takes a raw image as input and outputs a nonlinear transformation representing the image features. A CNN is used in this block. CNNs are known for their ability to extract relevant features directly from raw images. Technically, any CNN architecture can be used in this block. However, we observed in our experiments that only two convolutions followed by downsampling layers, e.g., max-pooling yields satisfactory results. This is in accordance with the assumption of statistical methods that the second order information is enough to estimate the illumination [13, 25].
After a raw image is fed to the feature extraction block, the output of the last convolutional layer is used to extract feature vectors that are subsequently fed to the next block. The number of extracted feature vectors depends on the size of the feature map and the used filter size as described in [17].
3.2 *Bag of Features *
The Bag of Features is essentially the codebook (dictionary) learning component. The output features of the previous block are pooled using the Bag-of-Features Pooling and mapped to a final histogram representation. During training, the network optimizes the codebook using the traditional back-propagation. The output of this block is a histogram of a fixed size, i.e., the size of the codebook, which is a hyper-parameter that needs to be carefully tuned to avoid over-fitting. This approach discards all global spatial information. As described in [17], the Bag-of-Features Pooling is composed of two sub-layers: an RBF layer that measures the similarity of the input features to the RBF centers and an accumulation layer that builds the histogram of the quantized feature vectors. The normalized output of each RBF neuron can be expressed as
[TABLE]
where x is a feature vector, is the center of the k-th RBF neuron, exp is the exponential function, and is a scaling factor. The output of the RBF neurons is accumulated in the next layer, compiling the final representation of each image:
[TABLE]
where N is the number of feature vectors extracted from the last convolutional layer for the image.
3.3 Illumination Estimation
The Bag of Features layer receives a transformation of the raw image and compiles its histogram representation. Then, this histogram is fed to a regressor that estimates the illu- mination. In this work, a multi-layer perceptron with one hidden layer is used for this purpose, although any other estimator with differentiable loss function can be used.
Let be the histogram compiled by the second block. The intermediate layer output can be computed as follows
[TABLE]
where is the weight matrix, is the bias vector, and is the Rectified Linear Units (ReLU) activation function [43]. The final estimate is computed as follows
[TABLE]
where is the weight matrix, is the bias vector, and is the softmax activation function defined by
[TABLE]
3.4 Attention mechanism for BoCF
We introduce a novel attention mechanism in the BoCF model to enable the algorithm to dynamically learn to focus on a specific region of interest in order to yield a confident output. We combine self-attention, described in Section 2.3, with the Bag-of-Features Pooling for the color constancy problem. We propose two variants of this mechanism which can be applied in our model. For the mapping function f in (Eq. 4), we use a fully connected layer with softmax activation.
In the first variant, we propose to apply attention on the nonlinear transformation of the image after the feature extraction block. This enables the model to learn to ’attend’ the region of the interest in the mapping and to reduce noise before pooling. By applying attention in this stage, the number of parameters will rise exponentially as we need as many parameters as features.
In the second variant, we propose to apply the attention mechanism on the histogram representation of the BoCF, i.e., after the global spatial information is discarded. This enables the model to dynamically learn to ’tend’ to the relevant parts of the histogram which encode the illuminant information. In this variant, the attention mask size is equal to the size of the histogram. Thus, the number of additional parameters is relatively small. Following the notations of (4) and (5), is the histogram representation and is the attention mask is obtained via the fully connected layer as follows:
[TABLE]
where is a weight matrix, is the bias.
Using softmax as ensures that the masking constraint defined in (Eq. 5) is not violated. Finally, y, the final output of the attention mechanism, is computed using the following equation
[TABLE]
where is the element wise product operator and is a weighting parameter between the masked histogram and the original histogram x. is a learnable parameter in our model. Not using and outputting only the masked histogram is also another option. However, we determined experimentally that outputting the weighted sum of both the original and the masked version is more robust and stable for the gradient-based optimizers, since it is less susceptible to random initialization weights of the attention.
Parameter can be optimized using the gradient decent in the back-propagation process along with the rest of the parameters. Its gradient with respect to the output of the attention block can be obtained via the following equation
[TABLE]
4 Experimental setup
In this section, we present the experimental setup used in this work. In Subsection 4.1, we introduce the datasets used to test our models. In Subsection 4.2, we report the network architectures of the three blocks used in BoCF. In Subsection 4.3, we detail the evaluation process followed in our experiments. Finally, the evaluation metrics used are briefly described in Subsection 4.4.
4.1 Image datasets
4.1.1 ColorChecker RECommended dataset
ColorChecker RECommended dataset [12] is a publicly available updated version of Gehler-Shi dataset [11]111http://www.cs.sfu.ca/ colour/data/shi_gehler/ with a proposed (recommended) ground truth to use for evaluation. This dataset contains 568 high-quality indoor and outdoor images acquired by two cameras: Canon 1D and Canon 5D.
Similar to the works in [5, 6, 7, 8], for Color Cheker REComended dataset, we used three-fold cross validation to evaluate our algorithms.
4.1.2 NUS-8 Camera Dataset
NUS-8 is a publicly available dataset222http://cvil.eecs.yorku.ca/projects/public_html/illuminant/illuminant.html, containing 1736 raw images from eight different camera models. Each camera has about 210 images. Following previous works [13, 6], we perform tests on each camera separately and report the mean of all the results for each evaluation metric. As a result, although the total number of images in NUS-8 dataset is large, each experiment involves using only 210 images for both training and testing.
4.1.3 INTEL-TUT2
INTEL-TUT2333http://urn.fi/urn:nbn:fi:csc-kata20170901151004490662 is the second version of the publicly available INTEL-TUT dataset [10]. The main particularity of this dataset is that it contains a large number of images taken by several cameras from different scenes. We use this dataset for an extreme testing protocol, the third protocol described in [10]. The models are trained with images acquired by one camera and containing one type of scene and tested on the other cameras and the other scenes. This extreme test is useful to show the robustness of a given model and its ability to generalize across different cameras and scenes.
INTEL-TUT2 contains images acquired with three different cameras, namely Canon, Nikon, and, Mobile. For each camera, the images are divided into four sets: field (144 images per camera), lab printouts (300 images per camera), lab real scenes (4 images per camera), and field2. The last set field2 concerns only Canon and it has a total of 692 images. Figure 3 shows some samples from the field, lab printouts, and lab real scenes sets of the three cameras, while Figure 4 displays samples from field2 related to Canon camera.
We used only Canon field2 set for training and validation (80% for training and 20% for validation). We constructed two test sets. The first one, called field in this work, contains all the field images taken by the other camera models, i.e., Nikon and Mobile. The second set, called non-field in this work, contains all the non-field images acquired by Nikon and Mobile. Comparing the performance on these two sets allowed us to test both scene and camera invariance of the model. As we are using different camera models in same experiments, the variation of camera spectral sensitivity needs to be discounted. For this purpose, we use Color Conversion Matrix (CCM) based preprocessing[44] to learn the Color Conversion Matrix (CCM) matrices for each camera pair.
4.2 Network architectures
The BoCF network is composed of three blocks: the feature extraction, the Bag of Features, where the Bag-of-Features Pooling is applied, and the illumination estimation blocks as described in Section 3. The feature extraction block consists of convolution layers followed by max pooling operators. We experiment with different number of layers two and three. Thirty convolution filters of size are used in both layers. Max-pooling with a window size 2 is applied in both layers. For the codebook size, i.e., number of RBF neurons in the Bag of Features block, we experiment with 3 different values 50, 150 and 200. The illumination estimation block consists of 2 fully connected layers, the first (hidden layer) has a size of 40 and it takes as input the histogram representation and the second one (final output) has of size 3 to output the illumination.
4.3 Evaluation procedure
To evaluate the proposed approach, we used 2 sets of experiments. In the first set, we evaluate different variants of the model to study the effect of the hyper-parameters and validate the effectiveness of each component in our model by conducting ablation studies. For this purpose, we used ColorChecker RECommended dataset. In the second set of experiments, we compared our approach with current state-of-the-art approaches on the three datasets.
For all testing scenarios, we augmented the datasets using the following process: As the size of the original raw images is high, we first randomly cropped patches of each image. This ensured getting meaningful patches. The crops were then rotated by a random angle between -30°and +30°. Finally, we rescaled the RGB values of each patch and its corresponding ground truths by random factor in the range of [0.8, 1.2]. Before feeding the sample to the network, we down-sampled it to . In testing, the images are resized to to fit the network model.
Our network was implemented in Keras [45] with Tensorflow backend [46]. We trained our network end-to-end by back-propagation. For optimization, Adam [47] was employed with a batch size of 15 and a learning rate of . The model was trained on image patches of size for 3000 epochs. The centers of the dictionary were initialized using the k-means algorithm as described in [17]. The parameter , discussed in Section 3.4, was initialized as 0.5.
4.4 Evaluation metrics
We report the mean of the top 25%, the mean, the median, Tukey’s trimean, and the mean of the worst 25% of the recovery angular error (RAE) [48] between the ground truth illuminant and the estimated illuminant, defined as
[TABLE]
where is the ground truth illumination for a given image and is the estimated illumination.
5 Experimental results
In this section, we provide the experimental evaluation of the proposed method and its variants. In Subsection 5.1, different topologies for the three blocks of BoCF are evaluated on the ColorChecker RECommended dataset and the effect of each block in our model is examined by reporting the results of the ablation studies. In Subsection 5.2, we compare the performance of the proposed models with different state-of-the-art algorithms over the three datasets.
5.1 * BoCF performance evaluation*
We first evaluated the accuracy of the different variants of BoCF on ColorChecker RECommended dataset. Table I presents the comparative results for BoCF using different topologies in the three blocks. We evaluate the model using different dictionary sizes in the second block (codewords), different numbers of convolution layers in the first block, and with/without attention.
Table I shows that the dictionary size in the Bag-of-Features Pooling block significantly affects the overall performance of the model. Using a larger codebook results in higher risk of overfitting to the training data, while using a smaller codebook size restricts the model to only few codebook centers which can decrease the overall performance of the model. Thus, the choice of this hyperparameter is critical for our model. The findings in Table I confirm this effect and highlights the importance of this hyperparameter. By comparing the model performance using different dictionary sizes, we can see that a dictionary of size 150 yields the best compromise between the number of parameters and the overall performance.
Using three convolutional layers instead of two in the first block yields slightly better median errors and worse trimean errors. However, to keep the model as shallow as possible, we opt for the two convolution layers.
Table I shows that models equipped with an attention mechanism perform better than models without attention almost consistently across all error metrics. This is expected as attention mechanisms allow the model to focus on relevant parts only and as a result, the model becomes more robust to noise and to inadequate features. The performance boost obtained by both attention variants is more highlighted in terms of the median and trimean errors compared to the non-attention variant.
By comparing the performance achieved by the two attention variants, we note that the first attention variant yields in a better performance in terms of worse 25% error rate, while the second variant yields a better median and trimean error rates. It should also be remembered that the first variant applies attention over the feature map output of the first convolutional block. Thus, it dramatically increases the number of model parameters (over 20 times) compared to the second variant (doubling the number of parameters) which applies the attention over the histogram.
Figure 5 presents a visualization of the attention weights[49] for both attention variants. The heat maps demonstrate which regions of the image each model pays attention to so as to output a certain illumination. We note a large difference between both attentions. The first attention variant tends to focus on regions with dense edges and sharp shapes, while the second model focuses on uniform regions to estimate the illumination.
Ablation studies
To examine the effect of each block in our proposed approach, we conduct ablation studies on the colorChecker RECommended dataset. Table II reports the results of the basic BoCF approach, the results achieved by removing the feature extraction block, and the results obtained by removing the estimation block, i.e., replacing the fully connected layer in the estimation block with a simple regression, We note that removing any block significantly decreases the overall performance of our models.
Comparing the model with and without the feature extraction block, we note a large drop in performance especially in terms of the worst 25% error rates, i.e., 1.8°drop compared to 0.6°drop when the estimation block is removed.
5.2 Comparisons against state-of-the-art
We compare our BoCF approach with the state-of-the-art methods on ColorChecker RECommended, NUS-8, and INTEL-TUT2 datasets, which have been widely adopted as benchmark datasets in the literature. Tables IV, V, and VI provide quantitative results for ColorChecker RECommended, NUS-8, and INTEL-TUT2 datasets, respectively. We provide results for the static methods Grey-World, White-Patch, Shades-of-Grey, and General Grey-World. The parameter values , , are set as described in [25]. In addition, we compare against Pixel-based Gamut, Bright Pixels, Spatial Correlations, Bayesian Color Constancy [11], and six convolutional approaches: Deep Specialized Network for Illuminant Estimation (DS-Net) [7], Bianco CNN [5], Fast Fourier Color Constancy [50], Convolutional Color Constancy[51], Fully Convolutional Color Constancy With Confidence-Weighted Pooling (FC4) [6], and Color Constancy GANs (CC-GANs) [35]. The results for ColorChecker RECommended and NUS-8 datasets were taken from related papers [35, 6].
From Recommended Color Checker and NUS-8 datasets results in Tables IV and V, we note that learning-based methods usually outperform statistical-based methods across all error metrics. This can be explained by the fact that statistical approaches rely on some assumptions in their model. These assumptions can be violated in some testing samples which results in high error rates especially in terms of the worst 25% errors.
Table IV shows that the proposed method BoCF and its variants achieve competitive results on Recommended Color Checker dataset. The only models performing slightly better than BoCF are FC4(SqueezeNet) and DS-Net. By comparing the number of parameters required by each model given in Table III, we see that BoCF achieves very competitive results, while using less than 1% of the parameters of FC4(SqueezeNet) and less than 0.1 % of the parameters of DS-Net.
Compared to Bianco’s CNN, we note that our model performs better across all error metrics except for the worst 25% error metric. Bianco CNN operates on patches instead of the full image directly and this makes it more robust but, at the same time, it increases its time complexity as the network has to estimate many local estimates before outputting the global one.
Results for NUS-8 dataset are similar to their counter parts on ColorChecker RECommended, as illustrated in Table V. Our models achieve comparable results with FC4 and overall better results compared to DS-Net across all error metrics. Bianco’s CNN outperforms all the other CNN-based methods. As discussed earlier, this can likely be explained by the fact that Bianco operates on patches while BoCF and FC4 produce global estimates directly.
Table VI reports the comparative results achieved on INTEL-TUT2 dataset. We note that all the error rates are high as this is an extreme testing senario. The models are trained and validated using only one type of scene (field2 set) acquired by one camera model (Canon) and then evaluated over different scene types and different camera models not seen during the training as described in Section 4.3. The proposed BoCF model achieves better overall performance compared to Bianco’s CNN and Color Constancy Convolutional AutoEncoder (C3AE) methods and competitive results compared to FC4.
By comparing the performance achieved by BoCF with and without attention, we note both the attention mechanisms proposed in this paper significantly boost the performance of our model for all datasets. It should also be mentioned that despite requiring much less parameters, the second variant of our attention model, where the attention is applied over the histogram representation, performs slightly better than the first variant, where the attention is applied over the feature extraction block.
6 Discussion
When comparing our approach to the competing methods, it must be pointed out that our approach can be linked to many previous static-based approaches. In Grey-World[24], one takes the average of the RGB channels of the image. In the proposed method, this corresponds to using the identity as a feature extractor and using equal weights in the estimation block. This way all the histogram bins will contribute equally in the estimation. White-Patch[23] takes the max across the color channels, which corresponds to giving a high weight to the histogram bin with the highest intensity and giving zero weights to the rest. Grey-edge and its variants[25] correspond to using the first and second order derivatives as a feature extractor. Thus, BoCF approach can be interpreted as a learning-based generalization of these statistical based approaches. Instead of using the image directly, we allow the model to learn a suitable non-linear transformation of the original image, through the feature extraction block, and instead of imposing a prior assumption on the contribution of each feature in the estimation, we allow the model to learn the mapping dynamically using the training data.
It is interesting to note that the attention variants in our approach can be tightly linked to the confidence maps in FC4 [6]. In FC4, confidence scores are assigned to each patch of the image and a final estimate is generated by a weighted sum of the scores and their corresponding local estimates. This way the network learns to select which features contribute to the estimation and which parts should be discard. Similarly, attention mechanism learn to dynamically pay attention to the parts encoding the illumination information and discarding the rest.
7 Conclusion
In this paper, we proposed a novel color constancy method called BoCF, which is composed of three blocks. In first block, called feature extraction, we employ convolutional layers to extract relevant features from the input image. In the second block, we apply Bag-of-Features Pooling to learn a codebook and output of histogram. The latter is fed into the last block, the estimation block, where the final illumination is estimated. This end-to-end model is evaluated and compared with prior works over three datasets: ColorChecker RECommended, NUS-8, and INTEL-TUT2. BoCF was able to achieve competitive results compared to state-of-the-art methods while reducing the number of parameters up to 95%. In this paper, we also discussed links between the proposed method and statistic based methods and we showed how the proposed approach can be interpreted as a supervised extension of these approaches and can act as a generic framework for expressing existing approaches as well as developing new powerful methods.
In addition, we proposed combining the Bag-of-Features Pooling with two novel attention mechanisms. In the first variant, we apply attention over the nonlinear transform of the image after the feature extraction block. In the second extension, we apply attention over the histogram representation of the Bag-of-Features Pooling. These extensions are shown to improve the overall performance of our model.
In future work, extensions of the proposed approach could include exploring regularization techniques to ensure diversity in the learned dictionary and improve the generalization capability of the model.
Acknowledgments
This work was supported by the NSF-Business Finland Center for Visual and Decision Informatics project (CVDI) project AMALIA, Dno 3333/31/2018, sponsored by Intel Finland.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Ebner, Color Constancy , 1st ed. Wiley Publishing, 2007.
- 2[2] J. J. M. Granzier, E. Brenner, and J. B. J. Smeets, “Can illumination estimates provide the basis for color constancy?” Journal of Vision , pp. 18–18, 2009.
- 3[3] K. Barnard, “Practical colour constancy,” Ph.D. dissertation, Burnaby, BC, Canada, 1999.
- 4[4] A. Gijsenij, T. Gevers, and J. van de Weijer, “Computational color constancy: Survey and experiments,” IEEE Transactions on Image Processing , pp. 2475–2489, 2011.
- 5[5] S. Bianco, C. Cusano, and R. Schettini, “Color constancy using CN Ns,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops , 2015, pp. 81–89.
- 6[6] Y. Hu, B. Wang, and S. Lin, “FC 4: Fully convolutional color constancy with confidence-weighted pooling,” in IEEE Conference on Computer Vision and Pattern Recognition , 2017, pp. 4085 – 4094.
- 7[7] W. Shi, C. C. Loy, and X. Tang, “Deep specialized network for illuminant estimation,” in European Conference on Computer Vision . Springer, 2016, pp. 371–387.
- 8[8] Y. Hold-Geoffroy, K. Sunkavalli, S. Hadap, E. Gambaretto, and J.-F. Lalonde, “Deep outdoor illumination estimation,” IEEE Conference on Computer Vision and Pattern Recognition , pp. 2373–2382, 2017.
