DeepIlluminance: Contextual Illuminance Estimation via Deep Neural   Networks

Jun Zhang; Tong Zheng; Shengping Zhang; Meng Wang

arXiv:1905.04791·cs.CV·July 12, 2019

DeepIlluminance: Contextual Illuminance Estimation via Deep Neural Networks

Jun Zhang, Tong Zheng, Shengping Zhang, Meng Wang

PDF

Open Access 1 Repo

TL;DR

DeepIlluminance introduces a deep neural network with contextual and refinement modules for more accurate scene illumination estimation, addressing local ambiguity issues and achieving competitive results on benchmark datasets.

Contribution

The paper presents a novel contextual deep network with a stage-wise training strategy for improved illuminant estimation in color constancy tasks.

Findings

01

Achieves competitive performance on illuminant estimation benchmarks.

02

Utilizes a center-surround architecture for local contextual feature extraction.

03

Employs a refinement network to enhance initial estimates.

Abstract

Computational color constancy refers to the estimation of the scene illumination and makes the perceived color relatively stable under varying illumination. In the past few years, deep Convolutional Neural Networks (CNNs) have delivered superior performance in illuminant estimation. Several representative methods formulate it as a multi-label prediction problem by learning the local appearance of image patches using CNNs. However, these approaches inevitably make incorrect estimations for the ambiguous patches affected by their neighborhood contexts. Inaccurate local estimates are likely to bring in degraded performance when combining into a global prediction. To address the above issues, we propose a contextual deep network for patch-based illuminant estimation equipped with refinement. First, the contextual net with a center-surround architecture extracts local contextual features…

Tables6

Table 1. TABLE I: Performance comparison of our patch sampling method by selecting the bright and dark pixels against random sampling

Method

Mean

Med

Tri

Best

25%

Worst

25%

95th

Pct

Ours

1.94

1.29

1.47

0.37

4.56

5.74

RS

2.24

1.57

1.68

0.52

5.25

6.82

Table 2. TABLE II: Performance comparison of different model variations on the reprocessed Color Checker dataset

Method

Mean

Med

Tri

Best

25%

Worst

25%

95th

Pct

Central

2.16

1.52

1.66

0.44

5.17

6.43

Context (2-channel)

2.21

1.63

1.81

0.51

5.21

7.02

Context (siamese)

2.37

1.68

1.84

0.65

5.35

6.99

Context (pseudo-siamese)

1.99

1.41

1.52

0.40

4.59

5.82

Context (ours)

1.94

1.29

1.47

0.37

4.56

5.74

Table 3. TABLE III: Performance comparison of different model variations on the reprocessed Color Checker dataset

Method

Mean

Med

Tri

Best

25%

Worst

25%

95th

Pct

Context

1.94

1.29

1.47

0.37

4.56

5.74

Refine (

𝐞_{𝟐}

)

1.85

1.14

1.37

0.45

4.52

5.60

Refine (

𝐞_{𝟐} + 𝐞_{𝟑}

)

1.82

1.11

1.29

0.35

4.36

5.44

Table 4. TABLE IV: Performance comparison of different variations of the training scheme on the reprocessed Color Checker dataset

Method

Mean

Med

Tri

Best

25%

Worst

25%

95th

Pct

Scheme (1)

2.31

1.66

1.82

0.61

5.23

6.94

Scheme (2)

1.94

1.29

1.47

0.37

4.56

5.74

Scheme (3)

1.83

1.12

1.29

0.37

4.39

5.45

Scheme (4)

1.82

1.11

1.29

0.35

4.36

5.44

Table 5. TABLE V: Performance comparison on the reprocessed Color Checker dataset. The best three results are shown in red , green , and blue , respectively

Method	Mean	Med	Tri	Best-25%	Worst-25%	95th Pct
White-Patch [33]	7.55	5.68	6.35	1.45	16.12	-
Edge-based Gamut [34]	6.52	5.04	5.43	1.90	13.58	-
Gray-World [35]	6.36	6.28	6.28	2.33	10.58	11.3
1st-order Gray-Edge [36]	5.33	4.52	4.73	1.86	10.03	11.0
2nd-order Gray-Edge [36]	5.13	4.44	4.62	2.11	9.26	-
Shades-of-Gray [37]	4.93	4.01	4.23	1.14	10.20	11.9
Bayesian [8]	4.82	3.46	3.88	1.26	10.49	-
General Gray-World [3]	4.66	3.48	3.81	1.00	10.09	-
Intersection-based Gamut [34]	4.20	2.39	2.93	0.51	10.70	-
Pixel-Based Gamut [34]	4.20	2.33	2.91	0.50	10.72	14.1
Natural Image Statistics [38]	4.19	3.13	3.45	1.00	9.22	11.7
Bright Pixel [39]	3.98	2.61	-	-	-	-
Spatio-spectral (GenPrior) [40]	3.59	2.96	3.10	0.95	7.61	-
Cheng et al. [18]	3.52	2.14	2.47	0.50	8.74	-
Corrected-Moment (19 Color) [41]	3.50	2.60	-	-	-	8.60
Exemplar-based [9]	3.10	2.30	-	-	-	-
Corrected-Moment (19 Edge) [41]	2.80	2.00	-	-	-	6.90
Regression Tree [42]	2.42	1.65	1.75	0.38	5.87	-
NetColorChecker [16]	3.10	2.30	-	-	-	-
CNN [10]	2.36	1.98	-	-	-	-
Oh & Kim [27]	2.16	1.47	1.61	0.37	5.12	-
CCC (dist+ext) [15]	1.95	1.22	1.38	0.35	4.76	5.85
DS-Net (HpyNet+SelNet) [13]	1.90	1.12	1.33	0.31	4.84	5.99
AlexNet-FC4 [11]	1.77	1.11	1.29	0.34	4.29	5.44
FFCC-thumb [12]	2.01	1.13	1.38	0.30	5.14	-
Ours	1.82	1.11	1.29	0.35	4.36	5.44

Table 6. TABLE VI: Performance comparison on the NUS 8-camera dataset. The best three results are shown in red , green , and blue , respectively

Method	Mean	Med	Tri	Best-25%	Worst-25%	Geomean
White-Patch [33]	10.62	10.58	10.49	1.86	19.45	8.43
Edge-based Gamut [34]	8.43	7.05	7.37	2.41	16.08	7.01
Pixel-based Gamut [34]	7.70	6.71	6.90	2.51	14.05	6.60
Intersection-based Gamut [34]	7.20	5.96	6.28	2.20	13.61	6.05
Gray-World [35]	4.14	3.20	3.39	0.90	9.00	3.25
Bayesian [8]	3.67	2.73	2.91	0.82	8.21	2.88
Natural Image Statistics [38]	3.71	2.60	2.84	0.79	8.47	2.83
Shades-of-Gray [37]	3.40	2.57	2.73	0.77	7.41	2.67
Spatio-spectral (ML) [40]	3.11	2.49	2.60	0.82	6.59	2.55
General Gray-World [3]	3.21	2.38	2.53	0.71	7.10	2.49
2nd-order Gray-Edge [36]	3.20	2.26	2.44	0.75	7.27	2.49
Bright Pixel [39]	3.17	2.41	2.55	0.69	7.02	2.48
1st-order Gray-Edge [36]	3.20	2.22	2.43	0.72	7.36	2.46
Spatio-spectral (GenPrior) [40]	2.96	2.33	2.47	0.80	6.18	2.43
Cheng et al. [18]	2.92	2.04	2.24	0.62	6.61	2.23
CCC (dist+ext) [15]	2.38	1.48	1.69	0.45	5.85	1.74
Oh & Kim [27]	2.36	2.09	-	-	4.16	-
Regression Tree [42]	2.36	1.59	1.74	0.49	5.54	1.78
DS-Net(HpyNet+SelNet) [13]	2.24	1.46	1.68	0.48	6.08	1.74
AlexNet-FC4 [11]	2.12	1.53	1.67	0.48	4.78	1.66
FFCC-thumb [12]	2.06	1.39	1.53	0.39	4.80	-
Ours	2.17	1.50	1.67	0.47	5.16	1.51

Equations9

e_{1}

e_{1}

P_{1}

P_{2} = M (e_{2} \circ R (F; W_{3}), P_{c})

P_{2} = M (e_{2} \circ R (F; W_{3}), P_{c})

\frac{1}{N} i = 1 \sum N min (∥ e_{i} - e_{i}^{*} ∥_{2}^{2})

\frac{1}{N} i = 1 \sum N min (∥ e_{i} - e_{i}^{*} ∥_{2}^{2})

ϵ = a r ccos (\frac{e \cdot e ^{*}}{∥ e ∥ \cdot ∥ e ^{*} ∥})

ϵ = a r ccos (\frac{e \cdot e ^{*}}{∥ e ∥ \cdot ∥ e ^{*} ∥})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pencilzhang/DeepIlluminance-computational-color-constancy
caffe2Official

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsColor Science and Applications · Image Enhancement Techniques · Visual perception and processing mechanisms

Full text

DeepIlluminance: Contextual Illuminance Estimation via Deep Neural Networks

Jun Zhang, Tong Zheng, Shengping Zhang, and Meng Wang J. Zhang, T. Zheng, and M. Wang are with the School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, Anhui, 230601 China.S. Zhang is with the School of Computer Science and Technology, Harbin Institute of Technology, Weihai, Shandong, 264209 China.Corresponding author: Jun Zhang (e-mail: [email protected])

Abstract

Computational color constancy refers to the estimation of the scene illumination and makes the perceived color relatively stable under varying illumination. In the past few years, deep Convolutional Neural Networks (CNNs) have delivered superior performance in illuminant estimation. Several representative methods formulate it as a multi-label prediction problem by learning the local appearance of image patches using CNNs. However, these approaches inevitably make incorrect estimations for the ambiguous patches affected by their neighborhood contexts. Inaccurate local estimates are likely to bring in degraded performance when combining into a global prediction. To address the above issues, we propose a contextual deep network for patch-based illuminant estimation equipped with refinement. First, the contextual net with a center-surround architecture extracts local contextual features from image patches, and generates initial illuminant estimates and the corresponding color corrected patches. The patches are sampled based on the observation that pixels with large color differences describe the illumination well. Then, the refinement net integrates the input patches with the corrected patches in conjunction with the use of intermediate features to improve the performance. To train such a network with numerous parameters, we propose a stage-wise training strategy, in which the features and the predicted illuminant from previous stages are provided to the next learning stage with more finer estimates recovered. Experiments show that our approach obtains competitive performance on two illuminant estimation benchmarks.

Index Terms:

Illuminant estimation, color constancy, local context, refinement, deep convolutional neural networks.

I Introduction

The computational color constancy [1] aims to estimate the unknown color of the illuminating light source given an RGB image, and then correct the chromaticity of the light source using the illuminant estimate. It is inherently ambiguous and a technically ill-posed problem because both the spectral distribution of the illuminant and the scene reflectance are unknown. But it has been attracting increasing interest in the vision communities since various high-level visual understanding tasks require discounting the illuminant to obtain the “true color” or reflectance of objects, such as material recognition [2].

Early approaches are derived from image statistics or physical models that make a variety of assumptions about the image, such as gray-world [3], white-patch [4], and Lambertian surface [5]. These methods also assume that the illuminant in the image is spatially uniform. These assumptions are coarse approximations to the real-world cases and limit the performance.

To improve these methods, another line of research learns a discriminative objective function based on hand-crafted features to estimate the scene illumination directly by machine learning techniques, such as neural networks [6], support vector regression [7], Bayesian estimation [8], and exemplar learning [9]. However, these methods fail to estimate the scenes where the colors of objects are inherently similar to those of the light sources.

Very recently, Convolutional Neural Networks (CNNs) have been employed to learn the relationship between the pixels and the chromaticity of the light source, which outperform previous methods by a wide margin [10, 11, 12, 13, 14, 15, 16] and can be roughly grouped as global approaches and local approaches. The state-of-the-art performances [11, 12] are obtained by considering the semantic information at a global level and exploiting different model components on different datasets. In addition, what is often overlooked among local approaches is that color constancy is achieved by taking contextual information between the image patch and its surrounding illumination into account [17]. Since small patches can be greatly affected by surrounding variance, calculating local estimates can be difficult and the total global prediction suffers from the degraded performance.

To address the aforementioned issues, we propose a novel patch-based deep network with local contextual information for illuminant estimation and refinement. First, rather than introducing random or uniform sampling strategies [14, 10, 11], we sample image patches based on the selected bright and dark pixels from the color image by calculating and ranking the projection distances of all color pixels to the mean vector in the RGB domain. The rational behind our method is that pixels with large color differences can well describe the illumination direction [18]. The sampled patches are taken as inputs to our network, which is comprised of two cooperative sub-networks based on the VGG-16 architecture [19]: a feedforward contextual net and a refinement net, as shown in Figure 1.

The contextual net exploits both local features and neighbor contextual features to generate an initial illuminant estimate and correct the input patch via the diagonal transform [20]. The refinement net efficiently learns features over the joint input-output space by stacking the corrected patch and the original patch, and uses the intermediate features encoded in itself with skip connections to generate a finer illuminant estimation. Our approach allows for reevaluation of the illuminant color and features across the sampled patches. It is similar in spirit to some structured prediction methods [21, 22], which have made successive predictions with intermediate supervision to refine predictions.

Inspired by the conclusion [23] that the stage-wise training can avoid gradient diffusion and overfitting for deep networks by decoupling the feature extraction layers from the classification layer across successive stages, we propose a stage-wise training strategy by breaking down the entire network into two related sub-tasks, in which the predicted illuminant and the intermediate features are passed through the networks stage-by-stage. Finally, to obtain a global illuminant estimate, we use median pooling on all the local estimates. We show experimentally that the proposed framework achieves competitive performance over several state-of-the-art methods on two illuminant estimation benchmarks.

The main contributions of this work are listed as follows.

•

We propose a novel contextual deep network using a center-surround architecture with a refinement mechanism for illuminant estimation, which captures local contextual features for initial estimation and reevaluates the features in the joint input-output space used in conjunction with intermediate supervision for finer estimation. The proposed approach can be viewed as the first piece of work that shows how contextual information and successive refinement are critical to improve illuminant estimation, even without the use of semantic information [12, 11].

•

We sample image patches by selecting bright and dark pixels with large color differences in the RGB space. To the best of our knowledge, this is the first work to sample patches directly from the color domain for illuminant estimation.

•

We propose a stage-wise training strategy to leverage the initial estimation and intermediate supervision from the illuminant color, which serves to increase efficiency and reduce memory usage of our network while improving precision of illuminant estimation.

We review related work in Section II and present our approach in Section III, focusing on our contextual network architecture, refinement, and our training procedure. We present experiments and results in Section IV. In Section V discussion and further perspectives of this work are presented.

II Related work

Prior to the deep learning revolution, color constancy algorithms mainly relied on different assumptions and handcrafted features [24, 25]. In this section, we restrict ourselves mostly to recent methods that exploit CNNs. These methods can be roughly grouped as local approaches and global approaches.

II-A Local approaches

An early attempt [14] using CNNs is to extract conv features of non-overlapping patches based on AlexNet [26], and pass them to a support vector regression to estimate the illuminant color. To deal with non-uniform illumination, a multiple illuminant detector [10] using a kernel density estimator is proposed to determine if the image contains single or multiple illuminants. To handle the ambiguities of unknown reflections and local patch appearances, a deep specialized network [13] is presented where a hypotheses network generates two hypotheses of illuminants for a UV patch and a selection network adaptively picks the confident estimations from these hypotheses. Another noteworthy work [27] clusters the illuminants and then feeds them into a CNN with the new illumination labels. The final illuminant color is estimated by computing the weighted average of the cluster centers. These methods conduct the prediction in a large and diverse hypothesis space given limited training samples, making the results still unsatisfactory. More recently, an end-to-end fully convolutional network [11] is proposed to produce local estimates followed by a confidence weight pooling to generate the global prediction. This method implicitly takes advantage of human faces as high-confidence regions and achieves impressive performance based on two backbone models. In this work, we propose a CNN-based framework to take the contextual information into consideration and refine local estimates of image patches in the joint input-output space.

II-B Global approaches

Global illumination estimation based on the whole images has been addressed by [16, 15, 12]. In the work of [16], three stacked CNNs are trained sequentially to get hierarchy features of the full image for illuminant estimation. Barron [15] and Barron & Tsai [12] formulated the task as a 2D spatial localization problem by learning conv filters in the log-chroma plane. These methods consider the semantic information at a global level, and predict the illuminant color with local details lost. In contrast, our work estimates the illuminant colors at a patch level by sampling semantically valuable local regions.

III Network architecture

We begin by describing the contextual net for initial estimation in Section III-A, followed by a detailed description of our refinement net in Section III-B. Finally, we present the stage-wise training method in Section III-C.

III-A Contextual network for initial estimation

We choose the VGG-16 network [19] as our backbone model, which is pre-trained on the ImageNet dataset [28] for object recognition. We replace the last layer of $1000$ units that predicts the ImageNet classes with a layer containing $3$ units, encoding the continuous illuminant color for RGB channels. In principle, the backbone model can be replaced by other advanced shallow or deep networks in our system.

As illustrated in Figure 1(a), the contextual net consists of two central and surround separate streams, one fusion layer, and one decision net. The central stream takes an image patch $\mathbf{P_{c}}$ comprising selected bright and dark pixels as input and extracts its local features. The surround stream takes the surrounding neighbor $\mathbf{P_{s}}$ of the central patch and extracts global features to provide the larger contextual information. The kernel weights of the two streams are unshared. The fusion layer combines the last conv features from the two streams via element-wise summation, and gives it to the top decision net that consists of three fully connected layers separated by a ReLU layer for the initial local estimation $\mathbf{e_{1}}$ . Finally, the initial estimate is used to generate the color corrected patch $\mathbf{P_{1}}$ by applying the diagonal transform [20] to the original patch $\mathbf{P_{c}}$ . The working of the contextual net can be mathematically described by the following equations:

[TABLE]

where $\mathcal{F(\cdot)}$ denotes the output feature maps generated by the conv blocks of the two streams with weights $\mathbf{W_{c}}$ and $\mathbf{W_{s}}$ , $\mathcal{R(\cdot)}$ denotes illuminant regression by the fc layers with parameter $\mathbf{W_{1}}$ , and $\mathcal{M(\cdot)}$ denotes the diagonal transform.

One reason for the usage of such a center-surround architecture is that multi-scale information is known to be important in capturing spatial context in color constancy [17, 29]. Furthermore, by considering the central patch twice (i.e. in both the central stream and the surround stream) we implicitly put more focus on the pixels closer to the center of a patch, which can also improve precision of illuminant estimation. Note that the proposed contextual network shares the similar structure with the pseudo-siamese network proposed in [30] except that stream outputs are summed. As verified in experiments (Section IV-B), the proposed architecture improves the illuminant estimation accuracy compared with other variants.

III-B Refinement network for finer prediction

We also adopt the VGG-16 network as our fundamental building block for illuminant refinement. We take the network architecture further by stacking the refinement net and feeding the output of the contextual net as input, which provides the network with a mechanism for successive bottom-up processing in the joint input-output space and allows for the use of the intermediate features encoded in the refinement net.

Figure 1(b) shows a detailed illustration of the refinement net, which concatenates the corrected patch $\mathbf{P_{1}}$ and the original patch $\mathbf{P_{c}}$ along the third dimension with the long skip connection to generate new intermediate feature maps $\mathbf{F}=\mathcal{F}\left(\mathcal{CAT}\left(\mathbf{P_{c}},\mathbf{P_{1}}\right);\mathbf{W_{2}}\right)$ and the corresponding illuminant estimate $\mathbf{e_{2}}$ , where $CAT$ indicates the concatenation operator. Then, we append extra three fully-connected layers (denoted as $fc6\_3$ , $fc7\_3$ , and $fc8\_3$ ) behind the fifth conv block, which allows the intermediate features to be processed again to further predict the illuminant color $\mathbf{e_{3}}$ on the original patch. Finally, the new estimate is combined with $\mathbf{e_{2}}$ via element-wise product to improve the scale of the illuminant values, and even correct the wrong estimate to generate the final corrected patch $\mathbf{P_{2}}$ :

[TABLE]

where $\circ$ is an element-wise product.

Our method bears some similarity to structured prediction methods [21, 22] in a broad sense of being successive predictions included with the input, but our work is tailored for illuminant estimation and the network is different. Such a refinement implicitly ensures consistency of the output with the input by serving the corrected patch as an illuminant prior, and provides richer intermediate supervision and helps learn stage-specific refinements to predict the illuminant color. It is important to note that the fc weights are not shared across networks, and three losses are applied to the predictions separately using the same ground truth. The details for the training procedure are described below.

III-C Training and implementation

The entire network is trained using ground-truth RGB labels, i.e. to perform illuminant regression. We use a Euclidean loss with the following learning objective function:

[TABLE]

where $N$ is the number of training samples in a batch, $\mathbf{e}_{i}$ is the illuminant estimate for the $i$ -th patch, and $\mathbf{e}_{i}^{\ast}$ is the corresponding ground-truth illuminant color.

Our approach reduces illuminant estimation to a sequence of predictions. We propose a stage-wise training strategy shown in Figure 2, in which each stage is trained separately so that the features and the predicted illuminant at the early stages are provided to the next learning stage for finer estimation.

Stage-wise training has been previously studied [23], and shown substantially improved performance in pose estimation [31].

•

Stage 1: the weights of the $fc8\_1$ layer are initialized with zero-mean Gaussians. Other layers are initialized with the pre-trained VGG-16 network [19]. The central and the surround streams are fine-tuned to estimate the illuminant, respectively.

•

Stage 2: we only train the weights in the fc layers on top of the combination of output features $\mathbf{F_{c}}$ and $\mathbf{F_{s}}$ to generate an initial illuminant estimate $\mathbf{e_{1}}$ and its corrected patch $\mathbf{P_{1}}$ . The weights in the conv layers are transferred from the preceding stage and fixed at this stage.

•

Stage 3: the conv1-1 layer and the fc layers of the refinement net are initialized with zero-mean Gaussians. Other convolutional parameters are assigned with the VGG-16 weights. We train this stage with the contextual net fixed and predict the illuminant $\mathbf{e_{2}}$ by stacking the original patch $\mathbf{P_{c}}$ and its corrected patch $\mathbf{P_{1}}$ .

•

Stage 4: at the second refinement stage, we train the output layers (i.e. $fc6\_3$ , $fc7\_3$ and $fc8\_3$ ) based on the intermediate features $\mathbf{F}$ obtained from the stage 3 to produce a new estimate $\mathbf{e_{3}}$ , keeping the weights of all other layers learned from the preceding stages fixed.

We also try to train the whole network in an end-to-end manner. However, it is hard to train such a deep network with insufficient data in the benchmark datasets. As reported in the experiments, we are able to achieve good solutions in a reasonable computation time using the stage-wise training strategy.

Implementation details: We implement the network based on the Caffe framework [32] on a single NVIDIA Titan XP GPU. The standard stochastic gradient descent (SGD) is employed for optimization, where the initial learning rate, the momentum and the weight decay are set to $0.001$ , $0.9$ and $0.0005$ , respectively. We set the batchsize to $23$ and the maximum iteration step to $160K$ , and decay the learning rates by a factor of $0.1$ every $50K$ iterations. The proposed network takes only $0.16$ s to estimate the illuminant color of a whole image. The source code is released at https://github.com/pencilzhang/DeepIlluminance-computational-color-constancy.git.

IV Experimental results

In this section, we experimentally evaluate the proposed approach, and compare with state-of-the-art methods on single illuminant estimation.

IV-A Setup

IV-A1 Datasets and preprocessing

We use the reprocessed Color Checker Dataset [8] and the NUS 8-camera dataset [18] for benchmarking. The reprocessed Color Checker Dataset contains $568$ raw images. The NUS 8-camera dataset consists of $8$ subsets captured by $8$ different cameras, where each subset contains $210$ images. For both datasets, the Macbeth Color Checker chart is placed in each image to estimate the ground-truth illuminant color and masked out for both training and testing. Following previous work [1], three-fold cross validation is used to evaluate the network on both datasets. Since the VGG-16 network is pre-trained on the ImageNet dataset, where images are gamma-corrected for display, we also apply a gamma correction of $\gamma=1/2.2$ on linear RGB images.

IV-A2 Patch sampling

We project all the pixels of an image onto the mean vector and then rank the projection distances according to the method in [18]. The pixels ranking in the top $d\%$ distance are selected as bright pixels while the bottom $d\%$ are dark pixels. Then, we randomly sample $M$ central patches of $224\times 224$ size containing both the bright and dark pixels and their surrounding patches with $2$ times the size of the central patches as inputs to the network. In fact, we found that we could not obtain better accuracy and higher coverage with smaller or larger patch size. In this work, we set $M=15$ and $d=\left\{3.5,5,10\right\}$ in ascending order until the number of the sampled patches meets the quantity requirement. By taking patches as inputs, we have a much larger number of training samples to train the network.

IV-A3 Metrics

We adopt the angular error $\epsilon$ between the estimated illumination $e$ and the ground truth $e^{\ast}$ as the performance measure:

[TABLE]

We report the mean, median, tri-mean, means of the lowest-error $25\%$ and highest-error $25\%$ of the angular error as evaluation metrics. In addition, we use the $95$ percentile for the reprocessed Color Checker dataset. We run $8$ different experiments on the subset of the NUS 8-camera dataset and evaluate with the geometric mean (Geomean).

IV-B Ablation studies

We analyze the contribution of the model components on the reprocessed Color Checker Dataset and summarize the results in the following section.

IV-B1 Influence of patch sampling

We compare the illuminant estimation results of our patch sampling method and the random sampling method. Both methods take the same size and number of patches as inputs. Since the refinement net has a substantial effect on the performance of local estimation, we only report results of the contextual net in this experiment. Table I summarizes the quantitative results, which show our method achieves considerable improvements over the random sampling method (‘RS’) in terms of all metrics.

Figure 3 shows two examples with sampled patches obtained by the two methods and their corresponding angular errors.

We can observe that the sampled patches based on our method can be estimated more accurately, and are more coherent with the actual brightness contrast pattern in the image, showing large color differences between pixels.

IV-B2 Effect of the center-surround architecture

To verify the necessity of our center-surround architecture in the contextual network, we compare five design choices in this experiment, including single central network, 2-channel (stacking the central and surrounding patches as a 2-channel image), Siamese, Pseudo-siamese (the same architecture with Siamese but with unshared weights), and our proposed contextual architecture. Results are reported in Table II.

From the results, we can see that the architectures with unshared weights (i.e. pseudo-siamese and ours) gives a significant boost to other models, and our model exhibits clearly the best performance among all models.

To further understand what information is better preserved by the contextual representation, we compare the fused feature maps with the pooled conv5 feature maps of the single central stream, as shown in Figure 4.

We observe that the central stream alone is influenced by larger pixel intensity in the patches while the proposed contextual network can co-adjust the features of the central and surrounding patches.

IV-B3 Effect of the refinement

To better understand how our refinement components benefit the performance, we perform a detailed comparison of their performance. Table III shows the performance of several versions of refinement.

With the corrected patch stacked with the original patch, the performance of the subsequent refinement with $\mathbf{e_{2}}$ increases over the contextual network, which indicates that encompassing both the input and the initial estimate can expand the expressive power of hierarchical features over the joint space. We further see that the intermediate supervision in the further refinement with $\mathbf{e_{3}}$ does offer an improvement to the final illuminant estimation performance.

It is also interesting to observe the patches corrected early and refined stage-by-stage by the network. Two representative examples are visualized in Figure 5.

It is observed that the refinement net progressively improves the local estimates.

IV-B4 Comparison of training schemes

To further understand why the stage-wise training can achieve improvement, we explore different variants of training over the networks: (1) training the contextual net using the VGG-16 pre-trained weights, (2) training the fc layers of the contextual net on the basis of the off-the-shelf features obtained from the central and surround streams that are trained independently first, (3) training the refinement net in one stage, and (4) training the refinement net utilizing a stage-wise scheme.

The quantitative results are shown in Table IV.

It can be seen that directly fine-tuning the contextual net initialized with the VGG-16 pre-trained model (scheme 1) decreases the performance, comparing to training the fc layers on top of the off-the-shelf conv5 features (scheme 2). One explanation is that the network does not fully co-adapt the two streams by jointly training. Stage-wise training benefits from treating the conv layer and the fc layers separately, and gradient diffusion can be addressed by preventing the complex co-adaptation of the feature extraction layers with the decision layer as demonstrated by [23]. We also observe that the scheme (4) outperforms all other training schemes. This shows that the stage-wise refinement are indeed crucial for gaining better performance. In other words, at the early stages of training the network learns to perform the prediction by extracting coarse properties of the scene illuminant. During following stages, finer information is gradually provided to the network and the learned intermediate features from the previous stage are re-used to perform better predictions.

IV-C Comparison with state-of-the-arts

Our approach is compared with previous state-of-the-art methods on the reprocessed Color Checker dataset [8] and the NUS 8-camera dataset [18]. Most results of previous methods are directly from [11, 12, 13]. Among them, the recent FC4 [11] and FFCC [12] are proposed with model variants based on different backbone models or features on two datasets. For fair comparisons, we use the results of their methods with basic components. Tables V and Table VI report the metric values of the comparing methods on these two datasets, respectively. The results of other methods are directly taken from published works. For metric values not reported in the literature, their entries are left blank.

As shown in the tables, we can see that the proposed network achieves competitive performance in comparison to the state-of-art methods [15, 13, 11, 12]. Overall our network performs better on the reprocessed Color Checker dataset than the NUS 8-camera dataset. The reason is that the larger size of the reprocessed Color Checker dataset facilitates our learning. Moreover, our approach performs similarly to AlexNet-FC4 [11] (training on patches and improving local estimates via confidence-weighted pooling) across some error metrics, suggesting that the use of neighbor contexts and refinement for local patches is the driving force behind our algorithm’s performance. However, we see a performance reduction in best-25% error compared with FFCC-thumb [12] on the two datasets, which is likely because semantic information from object detection used in FFCC-thumb favors the lowest-error of the data. Our experiments suggest that color constancy algorithms may benefit from much larger datasets and high-level semantic cues. We show restored images of our approach against two recent SOTA methods on some sample images in Figure 6.

V Discussion

We propose a novel deep network for patch-based illuminant estimation. The network is able to capture the local contextual information using the central and surround convolution units with a refinement mechanism for reevaluation of the initial estimates and features. Our patch sampling method can significantly improve the performance of the network. Learning features over the joint space of the input patch and the output illumination in conjunction with the use of the intermediate features are critical for training the network in a stage-wise fashion. There still exist difficult cases not handled perfectly by our approach, as shown in the red box of Figure 7.

Comparing with the easy cases (shown in the blue box) restored by our approach, we found that non-solid color regions with glint and dark areas (especially achromatic ones), and bright and dark pixels located far apart generally lead the sampled patches to be biased with illuminant ambiguities, which are not meaningful enough to represent the scene illumination of the whole image. Therefore, removing noisy data from the training set, exploiting more effective contextual schemes (e.g., tuned suppression [43]) or high-level semantic cues at a global level may alleviate such problems.

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Gijsenij, T. Gevers, and J. van de Weijer, “Computational color constancy: Survey and experiments,” TIP , vol. 20, no. 9, pp. 2475–2489, 2011.
2[2] T. Wang, T. Ritschel, and N. Mitra, “Joint material and illumination estimation from photo sets in the wild,” in 3DV , 2018, pp. 22–31.
3[3] K. Barnard, L. Martin, A. Coath, and B. Funt, “A comparison of computational color constancy algorithms—part ii: Experiments with image data,” TIP , vol. 11, no. 9, pp. 985–996, 2002.
4[4] A. Gijsenij and T. Gevers, “Color constancy by local averaging,” in ICIAPW , 2007.
5[5] G. D. Finlayson and G. Schaefer, “Solving for colour constancy using a constrained dichromatic reflection model,” IJCV , vol. 42, no. 3, pp. 127–144, 2001.
6[6] V. C. Cardei, B. Funt, and K. Barnar, “Estimating the scene illumination chromaticity using a neural network,” JOSA A , vol. 19, no. 12, pp. 2374–2386, 2002.
7[7] W. Xiong and B. Funt, “Estimating illumination chromaticity via support vector regression,” JIST , vol. 50, no. 4, pp. 341–348, 2006.
8[8] P. V. Gehler, C. Rother, A. Blake, T. Minka, and T. Sharp, “Bayesian color constancy revisited,” in CVPR , 2008.