Boundary Aware Multi-Focus Image Fusion Using Deep Neural Network

Haoyu Ma; Juncheng Zhang; Shaojun Liu; Qingmin Liao

arXiv:1904.00198·cs.CV·November 5, 2019

Boundary Aware Multi-Focus Image Fusion Using Deep Neural Network

Haoyu Ma, Juncheng Zhang, Shaojun Liu, Qingmin Liao

PDF

TL;DR

This paper introduces a boundary aware deep neural network approach for multi-focus image fusion, significantly improving the quality of fused images especially near focus boundaries, outperforming existing methods.

Contribution

A novel boundary aware deep neural network architecture with specialized handling of boundary regions and a new dataset generation method for multi-focus image fusion.

Findings

01

Outperforms state-of-the-art methods quantitatively and qualitatively

02

Effectively handles boundary regions near focus/defocus boundaries

03

Uses a dual-network approach for different patch scenarios

Abstract

Since it is usually difficult to capture an all-in-focus image of a 3D scene directly, various multi-focus image fusion methods are employed to generate it from several images focusing at different depths. However, the performance of existing methods is barely satisfactory and often degrades for areas near the focused/defocused boundary (FDB). In this paper, a boundary aware method using deep neural network is proposed to overcome this problem. (1) Aiming to acquire improved fusion images, a 2-channel deep network is proposed to better extract the relative defocus information of the two source images. (2) After analyzing the different situations for patches far away from and near the FDB, we use two networks to handle them respectively. (3) To simulate the reality more precisely, a new approach of dataset generation is designed. Experiments demonstrate that the proposed method…

Tables1

Table 1. Table 1 : The quantitative comparison of different fusion methods on 24 widely used image pairs. The results of the proposed method without/with refinement are shown in the last two columns. The best average result of the compared methods is in bold. The number of image pairs that one method beats all the other methods is shown in the parentheses. Particularly, the without refinement version is ignored in this comparison.

Metrics

NSCT

SR

NSCT-SR

GF

DSIFT

CNN

Proposed Method

without/with Refinement

Q_{M ​ I}

0.9407 (0)

1.0807 (0)

0.9651 (0)

1.1001 (0)

1.1902 (1)

1.1561 (0)

1.2000

1.2002 (23)

Q_{G}

0.6789 (0)

0.6944 (0)

0.6823 (0)

0.7104 (0)

0.7162 (0)

0.7155 (5)

0.7187

0.7188 (18)

Q_{Y}

0.9549 (0)

0.9663 (0)

0.9576 (0)

0.9775 (1)

0.9841 (2)

0.9851 (6)

0.9867

0.9871 (16)

Q_{C ​ B}

0.7423 (0)

0.7641 (0)

0.7499 (0)

0.7848 (0)

0.8005 (7)

0.8000 (1)

0.8028

0.8042 (16)

Equations11

\overline{F S (m, n)} = \frac{1}{( 2 l + 1 ) ^{2}} \sum_{i = m - l}^{i = m + l} \sum_{j = n - l}^{j = n + l} F S (i, j) .

\overline{F S (m, n)} = \frac{1}{( 2 l + 1 ) ^{2}} \sum_{i = m - l}^{i = m + l} \sum_{j = n - l}^{j = n + l} F S (i, j) .

0.2 < \overline{F S (m, n)} < 0.8,

0.2 < \overline{F S (m, n)} < 0.8,

I m g F = D M \cdot I m g S_{A} + (1 - D M) \cdot I m g S_{B} .

I m g F = D M \cdot I m g S_{A} + (1 - D M) \cdot I m g S_{B} .

I m g S_{A} = F G_{O r i} \cdot GT + B G_{B l u} \cdot (1 - GT),

I m g S_{A} = F G_{O r i} \cdot GT + B G_{B l u} \cdot (1 - GT),

I m g S_{B} = F G_{B l u} \cdot GT + B G_{O r i} \cdot (1 - GT) .

D M = α \cdot ∣ I m g S_{B} - I m g F ∣ .

D M = α \cdot ∣ I m g S_{B} - I m g F ∣ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

BOUNDARY AWARE MULTI-FOCUS IMAGE FUSION USING DEEP NEURAL NETWORK

Abstract

Since it is usually difficult to capture an all-in-focus image of a 3D scene directly, various multi-focus image fusion methods are employed to generate it from several images focusing at different depths. However, the performance of existing methods is barely satisfactory and often degrades for areas near the focused/defocused boundary (FDB). In this paper, a boundary aware method using deep neural network is proposed to overcome this problem. (1) Aiming to acquire improved fusion images, a 2-channel deep network is proposed to better extract the relative defocus information of the two source images. (2) After analyzing the different situations for patches far away from and near the FDB, we use two networks to handle them respectively. (3) To simulate the reality more precisely, a new approach of dataset generation is designed. Experiments demonstrate that the proposed method outperforms the state-of-the-art methods, both qualitatively and quantitatively.

**Index Terms— ** Image fusion, multi-focus fusion, convolutional neural network, deep learning

1 Introduction

When capturing an image of a 3D scene, it is difficult to take an image where all the objects are focused since the depth-of-field is limited. However, in many image processing tasks, it is more convenient and effective to use all-in-focus images as the input. Multi-focus image fusion, a technic to generate an all-in-focus image from several images of the same scene focusing on different depths, is effective to address this problem. For decades, a large number of multi-focus image fusion algorithms have been proposed. Most of them can be broadly categorized into two groups, i.e., transform domain-based algorithms [1-7] and spatial domain-based algorithms [8-15].

The transform domain-based algorithms are usually based on multi-scale transform (MST) theories, such as the Laplacian pyramid (LP) [1], wavelet transform [2, 3], curvelet transform (CVT) [4], and non-subsampled contourlet transform (NSCT) [5]. These methods usually decompose the source images first, then extract and fuse the features of the images, and finally reconstruct the fused images. In addition to MST methods, some feature space-based methods have been proposed in recent years, such as independent component analysis (ICA) [6], sparse representation (SR) [7] and NSCT-SR [8]. Most of the transform domain methods are efficient, but the fusion images are usually indistinct.

The spatial domain-based algorithms can be further divided into block-based, region-based and pixel-based fusion algorithms. The block-based algorithms usually divide an image into blocks, measure their spatial frequency as well as sum modified-Laplacian [9], and then fuse the image blocks. In these algorithms, such as morphology-based fusion [10], the size of the image block has a great impact and is hard to decide. The region-based algorithms [11] are based on segmentation of input images and usually highly depend on segmentation accuracy. Recently, several pixel-based algorithms, including guided filtering (GF) [12] and dense SIFT (DSIFT) [13], have been proposed. They can achieve improved results, however, usually suffer from the blocking effect.

For all of these traditional fusion methods, both transform domain-based and spatial domain-based, the defocus level descriptors and the comparison rules need to be designed manually. However, the situations are quite complex for real photos, and the results of these methods are usually imperfect. In [14], CNN is used to extract the defocus level descriptors and the comparison rules in a data-driven way. Unfortunately, the results of existing neural network based approaches [14, 15] are still unsatisfied, especially for the areas near the focused/defocused boundary (FDB), therefore much post processing is employed to ease this problem. The reasons are explained as follows. Firstly, the network structures used in [14, 15] could be improved for the fusion task. Secondly, the situations for areas far away from and near the FDB are quite different, therefore, it is unwise using a single network to tackle these two situations together. Thirdly, training datasets used by [14, 15] are inconsistent with the reality.

In this paper, a boundary aware multi-focus image fusion approach using deep neural network is proposed to overcome the insufficiency, together with a new dataset generation method. The contributions are threefold. (1) Compared with existing networks for fusion [14, 15], a 2-channel structure is designed to better extract the relative defocus information of the two source images and improve the fusion results. (2) Instead of a single network, two networks are employed to tackle the different situations for areas far away from and near the FDB respectively. As a consequence, the FDB of the fusion image is more clear for the proposed method. (3) To simulate the reality better, especially the situation for area near the FDB, a more reasonable approach of dataset generation is presented. Based on these three improvements, the proposed method can obtain pleased fusion results. Experiments demonstrate that the proposed method outperforms the state-of-the-art methods, both qualitatively and quantitatively.

2 Proposed Fusion Method

Our method consists of 3 steps: initial score map generation, score map refinement, as well as post processing and image fusion. The block diagram of our method is shown in Fig. 1. Firstly, a 2-channel network is proposed to generate an initial score map. Secondly, the initial score map is refined by two other networks that are specially designed for areas near and far away from the FDB. Thirdly, some simple post processing is employed to generate the decision map from the refined score map. Then the fusion result is obtained from the two source images according to the decision map.

2.1 Initial Score Map Generation

Pixel-wised multi-focus image fusion can be viewed as a classification problem, and CNN is the classic model for image classification. Therefore it is natural to begin with a simple CNN. In order to get better fusion results, a network deeper than [14, 15] is needed. However, when a network is deepened, the degradation problem would arise, because the optimization is not similarly easy for all systems. The residual learning framework fixes this problem and achieves precise classification [16]. Therefore, it is employed here.

Since only the comparative information of input images is desired for multi-focus image fusion task, variety of structures might be applied. Tang [15] used a single input network, which paid little attention to the relativity between the source images. Using the two source images together as input should be a better choice. Furthermore, for multi-focus image fusion, two source images are taken from the same position, meaning that they are perfectly aligned in nature. The same result is expected to gain when the order of two source images is changed. Therefore, for pixel-wise image fusion, the 2-channel model is more effective and flexible than the siamese model [17]. As it’s demonstrated in Fig. 2, the two input source image patches are simply taken as one image patch with two channels.

2.2 Refinement

The area near the FDB of the fusion results is unsatisfied even using the proposed network. There are two reasons.

Firstly, the situations are quite different for patches far away from and near the FDB. For the patches far away from the FDB, the patches are totally focused or defocused. Consequently, the sharpness of the patch can be used as a metric to generate the score map. On the contrary, for patches near the FDB, both focused area and defocused area exist. Therefore, the sharpness of the patch is not suitable anymore, and the position of the center pixel becomes even more vital, since the proposed method is pixel-wise. For example, the patch should be thought as focused if the center pixel is in the focused region, even if the defocused area is larger in the whole patch. In conclusion, it is unwise to handle these two different situations using a single network.

Secondly, the number of patches near the FDB is much fewer than that far away from the FDB, as the training patches are usually chosen from training images randomly. The imbalance of the training patches will also lead to the bad performance near the FDB when using a single network.

Therefore, taking the defocus level comparison task apart as two relatively independent one might be a better choice. Specifically, based on the initial score map, each patch can be classified to either far away from or near the FDB. Then they are processed by the normal refinement net and the boundary refinement net, respectively, to obtain a better score map. It is worth to point out that the initial net already has a satisfied performance over the patches far away from the boundary. Therefore, the initial fusion net can serve as the normal refinement net directly without any loss in performance. Ultimately, the flow path of the proposed method is simplified. After the initial score map generation, the patches near the FDB are processed by the boundary refinement net. Then the initial focus scores of these position are replaced.

The classification of patches near the FDB is described as follows. Based on the initial score map, the average of focus score ( $FS$ ) at the centre pixel $(m,n)$ of a patch is calculated in a surrounding window of size $(2l+1)\times(2l+1)$ as follows.

[TABLE]

If the average focus score satisfies

[TABLE]

the patch is taken as near the FDB.

2.3 Post Processing and Image Fusion

After the refinement, some simple post processing such as binarization and small region removal [14] are still needed. Specifically, binarization is applied to the focus score map to obtain a decision map. Small regions inside the focused or defocused area are also cleared. These small tricks contribute some improvement to the final result, making it more visually comfortable. We intend to finish the main job by the networks, therefore, more post processing used in existing works, such as guided filter [14], is not needed.

Once the decision map is obtained, the fusion image ( $ImgF$ ) can be directly generated from the decision map ( $DM$ ) and the two source images ( $ImgS$ ) as follows.

[TABLE]

The score maps after each step are shown in Fig. 3. As shown in Fig. 3(c), the score map after the initial fusion net is decent, but imperfections exist near the FDB. Fig. 3(d) shows that the refinement net fixes these defects, and the result near the FDB becomes better. Then the post processing removes the small regions inside, and the decision map is given in Fig. 3(e). The fusion image is shown in Fig. 3(f).

3 Experiments

In this section, firstly a new approach to generate training dataset for deep-learning based multi-focus image fusion is proposed. Then the network details and the settings of training are discussed. Next, the comparative experiments with existing approaches are set. Finally, the results of proposed fusion method are shown and discussed.

3.1 Proposed Dataset Generation Method

A good training dataset should represent the normal and comprehensive situations of the task. In other fusion methods based on machine learning [14, 15], several training images generation approaches have been used. Unfortunately, none of them simulates the reality that the FDB usually coincides with the object boundary [18].

Considering the reality, the best choice is to use real photos. However there are only few multi-focus fusion source images, and the ground truth needs to be labeled manually. Therefore, a feasible method is to generate artificial training images that are similar to the reality yet easy to obtain. The foreground images dataset with ground truth is used and some images without obvious defocus are chosen as the background dataset. Both the original foreground ( $FG$ ) and the background ( $BG$ ) images are processed by Gaussian filters for the blurred images firstly. Then the source image pairs are obtained according to the ground truth ( $GT$ ) pixel by pixel.

[TABLE]

Specifically, Alpha Matting dataset [19], which contains 27 images with ground truth, are used as foreground dataset, and 700 background images from COCO 2017 dataset [20] are used as background dataset. The foreground and background images are combined one by one, therefore, 18,900 pairs of source image are obtained. For each source image pair, we randomly select 10 patches pairs. A training pair will be labeled 1 if the center pixel is focused in the source image A and defocused in the source image B, otherwise labeled 0. We then exchange the channels of a patch pair and label it the opposite, so the changing of input order will lead to the same result. Each source image pair is augmented by rotating and flipping. Finally, there are 2,268,000 training samples in total.

Additionally, the foreground images from Alpha Matting are larger than the background images from COCO dataset, so the foreground images are resized half at the very start.

3.2 Network Details and Training Settings

In our implementation, typical Res56 networks are employed. The influences of network depth and the size of patches are tested. The performance will be improved if a deeper network or larger patches are used. Nonetheless, there is a trade-off between performance with time and memory cost. The chosen 56-layer networks with input patch size of 64642, have produced satisfied results.

The same structure is applied to the initial fusion net and the boundary refinement net, and the training is carried out on different sample sets. The initial fusion net is trained on the full sample set, while the boundary refinement net is trained only on the samples that are near the FDB. Only the patches having at least 10 percent focused pixels and 10 percent defocused pixels is chosen.

As for network settings, the normal setups of ResNet on CIFAR-10 [16] are used, since CIFAR-10 has similar input as our approach. The stochastic gradient descent (SGD) is applied to train the models, with softmax loss function. 15 percents of the training examples are randomly handout for verification. The initial learning rate is set to 0.001, and reduced after 80, 120, 160, 180 epochs. 200 epochs are used in total, and the batch size is 128. Therefore, there are about 15000 iterations during one single epoch in total.

3.3 Comparison Settings

We compare the proposed approach with 6 other multi-focus fusion methods, including NSCT [5], SR [7], NSCT-SR [8], GF [12], DSIFT [13], and CNN [14].

The comparison is conducted on 24 pairs of multi-focus images: 20 pairs from dataset ”Lytro”, the mostly mentioned dataset for image fusion; the others are ”clock”, ”book”, ”lab” and ”flower”, which are commonly used too.

A few objective metrics for fusion image quality assessment are used to evaluate the results: normalized mutual information ( $Q_{NMI}$ ) [21], gradient-based fusion performance ( $Q_{G}$ ) [22], Yang’s metric ( $Q_{Y}$ ) [23], Chen-Blum metric ( $Q_{CB}$ ) [24].

3.4 Experimental Results and Analysis

Fig. 4 shows the visual comparison on the ”clock”. Compared with other approaches, the proposed method performs well in general. Moreover, both the front clock’s edge and the left side of number ”8” in the behind clock are clear in our result, as shown in the enlarged squares.

Fig. 5 shows another example, ”lytro-05”. The difference maps ( $DM$ ) is made to show the enhanced visual result:

[TABLE]

Here $\alpha$ is set to 10, showing a more straightforward result. The difference maps are made into gray scale and the pseudo-color transformation is then applied. The difference map is expected to be red in the area where the source image is focused, and blue where the source image is blurred. Source image B is blurred on the iron grid and clear on the rest area. Compared with existing approaches, the difference map of our fusion result is distinct near the FDB. There is less incorrect division in the dark part of difference map, and the bright area is more continuous and clear than the others too.

The quantitative comparisons are shown in Table 1 and the results of proposed method without and with refinement are put in the last two columns. The larger evaluation means better fusion result for all the four quality metrics. The shown results are the average values over the 24 pairs of images, and the best average result of the compared methods is bold. The number of image pairs that one method beats all the other methods is shown in the parentheses. In addition, the without refinement version is ignored in the above comparisons.

The proposed method without refinement already outperforms existing methods, then the refinement improves the results, especially the $Q_{CB}$ . It means that both the modification of the network structure and the refinement are effective. As can be seen, the proposed approach markedly outperforms the other fusion methods on all the four quality metrics.

4 Conclusions

In this paper, we present a boundary aware multi-focus fusion approach base on deep neural networks. The proposed method utilizes residual networks, and a 2-channel model is applied to extract more useful information directly from the source images. Moreover, independent refinement networks are employed after the initial net, to deal with the different situations for area near and far away from the FDB, respectively. Furthermore, a new way to generate training samples is also proposed to better approximate the reality. Based on all these improvements, the proposed method obtains promising results and outperforms the state-of-the-art methods, both qualitatively and quantitatively.

5 ACKNOWLEDGEMENT

This work was supported by the National Natural Science Foundation of China under Grant No. 61771276.

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P. J. Burt and E. H. Adelson, “The laplacian pyramid as a compact image code,” IEEE Transactions on Communications , vol. 31, no. 4, pp. 532–540, 1983.
2[2] J. J. Lewis, R. J. O’Callaghan, S. G. Nikolov, D. R. Bull, and N. Canagarajah, “Pixel- and region-based image fusion with complex wavelets,” Inf. Fusion , vol. 8, no. 2, pp. 119–130, 2007.
3[3] Y. P. Liu, J. Jin, Q. Wang, Y. Shen, and X. Q. Dong, “Region level based multi-focus image fusion using quaternion wavelet and normalized cut,” Signal Processing , vol. 97, pp. 9–30, 2014.
4[4] F. Nencini, A. Garzelli, S. Baronti, and L. Alparone, “Remote sensing image fusion using the curvelet transform,” Inf. Fusion , vol. 8, no. 2, pp. 143–156, 2007.
5[5] Qiang Zhang and Bao long Guo, “Multifocus image fusion using the nonsubsampled contourlet transform,” Signal Processing , vol. 89, no. 7, pp. 1334 – 1346, 2009.
6[6] N. Mitianoudis and T. Stathaki, “Pixel-based and region-based image fusion schemes using ica bases,” Inf. Fusion , vol. 8, no. 2, pp. 131–142, 2007.
7[7] B. Yang and S. T. Li, “Multifocus image fusion and restoration with sparse representation,” IEEE Transactions on Instrumentation and Measurement , vol. 59, no. 4, pp. 884–892, 2010.
8[8] Y. Liu, S. P. Liu, and Z. F. Wang, “A general framework for image fusion based on multi-scale transform and sparse representation,” Inf. Fusion , vol. 24, pp. 147 – 164, 2015.