Multispectral and Hyperspectral Image Fusion Using a 3-D-Convolutional   Neural Network

Frosti Palsson; Johannes R. Sveinsson; Magnus O. Ulfarsson

arXiv:1706.05249·cs.CV·June 21, 2017

Multispectral and Hyperspectral Image Fusion Using a 3-D-Convolutional Neural Network

Frosti Palsson, Johannes R. Sveinsson, Magnus O. Ulfarsson

PDF

TL;DR

This paper introduces a 3-D convolutional neural network approach to fuse multispectral and hyperspectral images, achieving high-resolution hyperspectral images with reduced noise sensitivity and computational efficiency.

Contribution

The paper presents a novel 3-D-CNN based fusion method with dimensionality reduction, improving robustness and efficiency over traditional techniques.

Findings

01

Outperforms conventional fusion methods in accuracy

02

Effective noise robustness demonstrated

03

Reduces computational time significantly

Abstract

In this paper, we propose a method using a three dimensional convolutional neural network (3-D-CNN) to fuse together multispectral (MS) and hyperspectral (HS) images to obtain a high resolution hyperspectral image. Dimensionality reduction of the hyperspectral image is performed prior to fusion in order to significantly reduce the computational time and make the method more robust to noise. Experiments are performed on a data set simulated using a real hyperspectral image. The results obtained show that the proposed approach is very promising when compared to conventional methods. This is especially true when the hyperspectral image is corrupted by additive noise.

Figures7

Click any figure to enlarge with its caption.

Tables3

Table 1. TABLE I: 3D-CNN architecture. The numbers in parenthesis following zero-padding layers indicate the number of zeros added to each dimension. The numbers in parenthesis after convolution3D, indicate number of filters and the filter size of each dimension. r 𝑟 r indicates the number of PCs. Finally, the number following Gaussian noise denotes the noise variance.

layer #	Type	Activation
1	zero-padding3D (1,1,1)	none
2	convolution3D (32,3,3,3)	ReLU
3	Gaussian noise (0.5)	none
4	zero-padding3D (1,1,1)	none
5	convolution3D (64,3,3,3)	ReLU
6	Gaussian noise (0.5)	none
7	convolution3D ( $𝐫$ ,1,1,1)	none

Table 2. TABLE II: Quantitative quality evaluation results, without and with additive Gaussian noise (SNR=20dB). For the proposed and MAP1 methods, 10 PCs were used in reduced PCA. The CPU time is given in seconds. Six trials were performed for the noisy case, and also for the proposed method without noise.

Method	ERGAS	SAM	SSIM	CPU time
MAP1[4]	2.806	3.711	0.971	45
MAP2[5]	2.17	3.26	0.978	9
3D-CNN	1.676 $\pm$ 0.02	2.730 $\pm$ 0.02	0.988 $\pm$ 1.14e-4	978 $\pm$ 8
Noisy HS image (SNR=20dB)
MAP1[4]	3.95 $\pm$ 0.004	7.42 $\pm$ 0.005	0.89 $\pm$ 6.98e-4	45 $\pm$ 0.4
MAP2[5]	2.23 $\pm$ 0.002	3.46 $\pm$ 0.003	0.98 $\pm$ 6.06e-5	9 $\pm$ 0.4
3D-CNN	1.79 $\pm$ 0.05	3.03 $\pm$ 0.02	0.99 $\pm$ 3.3e-4	989 $\pm$ 10

Table 3. TABLE III: Performance of all methods w.r.t. to the interpolation filter used. Bicubic, bilinear and nearest neighbor interpolation is considered. One trial was performed for the proposed method.

Method	ERGAS	SAM	ERGAS	SAM	ERGAS	SAM
	Bicubic		Bilinear		Nearest
MAP1	2.806	3.711	3.080	4.721	5.680	5.501
MAP2	2.170	3.260	2.233	3.468	5.234	5.193
3D-CNN	1.676	2.730	2.069	3.022	3.104	3.858

Equations6

y_{i, j, k}^{n} = σ (b^{n} + (H^{n} * x)_{i, j, k}),

y_{i, j, k}^{n} = σ (b^{n} + (H^{n} * x)_{i, j, k}),

X^{HS} = V D U^{T} = G U^{T},

X^{HS} = V D U^{T} = G U^{T},

\hat{X}_{HR}^{HS} = \hat{G}_{HR} U^{T} = [\hat{G}_{HR}^{r} \tilde{G}^{q - r}] U^{T},

\hat{X}_{HR}^{HS} = \hat{G}_{HR} U^{T} = [\hat{G}_{HR}^{r} \tilde{G}^{q - r}] U^{T},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Multispectral and Hyperspectral Image Fusion Using a 3-D-Convolutional Neural Network

Frosti Palsson, Johannes R. Sveinsson, and Magnus O. Ulfarsson The authors are with the Faculty of Electrical and Computer Engineering, University of Iceland, Reykjavik, E-mail: ([email protected], [email protected], [email protected])This work was supported in part by the Icelandic Research Fund under Grant 174075-05 and in part by the University of Iceland Research Fund.

Abstract

In this paper, we propose a method using a three dimensional convolutional neural network (3-D-CNN) to fuse together multispectral (MS) and hyperspectral (HS) images to obtain a high resolution hyperspectral image. Dimensionality reduction of the hyperspectral image is performed prior to fusion in order to significantly reduce the computational time and make the method more robust to noise. Experiments are performed on a data set simulated using a real hyperspectral image. The results obtained show that the proposed approach is very promising when compared to conventional methods. This is especially true when the hyperspectral image is corrupted by additive noise.

Index Terms:

Image fusion, deep learning, convolutional neural networks, multispectral, hyperspectral.

I Introduction

Pansharpening, which is the fusion of a multispectral (MS) image and a wide-band panchromatic (PAN) image, is an important technique in remote sensing. Ideally, the fused image should contain all the spectral information from the MS image and all the spatial details from the PAN image.

With advances in sensor development, the fusion of a high spatial resolution MS image and a low spatial resolution hyperspectral (HS) image (MS/HS fusion) is becoming relevant for many applications. A typical HS image contains hundreds of spectral reflectance bands, making the spectral information content very high. This allows for the identification of different materials based on their spectral signature, which is useful for applications such as classification of land cover types.

Numerous pansharpening methods have been proposed in recent years and they are often categorized either as component substitution (CS) or multi-resolution analysis (MRA) methods. The CS and MRA methods can generally be described using a simple detail injection framework [1]. Apart from the CS and MRA methods there are model based methods such as [2, 3], and methods based on statistical inference such as [4, 5].

Although MS/HS fusion is a relatively new topic in remote sensing, there are several publications on the topic, including [6], which used sparse coding and spectral unmixing. A method using coupled non-negative matrix factorization and spectral unmixing was given in [7]. In [8], a method based on a 3D-wavelet transform was proposed. A common approach to the MS/HS fusion problem is to view it as a number of pansharpening sub-problems where each spectral band of the MS image takes the role of the PAN image [9, 10].

In the past decade, methods based on Deep Learning (DL) have in many cases outperformed traditional signal processing approaches in areas such as speech and pattern recognition [11]. The main component of DL is the artificial neural network (NN). More specifically, the so-called convolutional neural network (CNN) has been shown to be effective in pattern recognition and related areas [12].

DL based methods have previously been used to solve the pansharpening problem [13, 14]. Here, we propose a MS/HS fusion method using DL.

The method is based on training a 3D-CNN for learning filters used to fuse the MS and HS images. Since the method is based on supervised learning, it requires a target HS image, which is not available. Therefore, the input data need to be spatially decimated (low-pass filtered and downsampled) in order to use the observed HS image as the target image. The assumption being made here is that the relationship between the input and target data, learned by the 3D-CNN at a lower resolution scale, also applies for a higher resolution scale.

To make the fusion problem more computationally efficient, the dimensionality of the HS image is reduced prior to the fusion stage using principal component analysis (PCA) [15]. This is an important step of the proposed method and it depends on the assumption that the spectral singular vectors of the lower resolution HS image are identical to those of the higher resolution HS image that we want to estimate. By comparing our approach to the conventional methods given in [4, 5], it is demonstrated that the proposed method gives better results according to three quantitative quality metrics.

Another advantage of the proposed method is that the 3D-CNN learns the decimation filter in an automatic manner. In other words, the method is relatively insensitive to the choice of decimation filter used to prepare the training samples for the 3D-CNN. It also produces images that are free of artifacts, such as halos and ringing artifacts, often seen when using conventional methods.

The outline of this paper is as follows. In the next section, we briefly discuss CNNs. In Section III, the proposed method is described in detail. In Section IV, we present experimental results, and finally, in Section V, the conclusion is drawn.

II Convolutional Neural Networks

CNNs consist of convolutional layers and a convolutional layer consists of a number of hidden layers that contain a number of neurons. The main idea behind CNNs is the concept of a local receptive field [16], that is associated with each neuron in a hidden layer. The input to a convolutional layer is an image of one or more channels. Each neuron in a hidden layer receives input from a rectangular subset of the input image, which is the neuron’s receptive field.

By sliding the receptive field over the input image, and after each shift connecting to a new neuron in the hidden layer, the neurons of the hidden layer provide a complete tiling of the input image. All the neurons in the hidden layer share their weights and bias and therefore they can detect the same feature at different locations in the input image.

The output of a hidden layer is called a feature map and the shared weights are called a filter. A single convolutional layer can have many such feature maps and hence can learn several different filters that detect different distinct features. Between convolutional layers there can be so-called pooling layers, which sub-sample the feature maps according to some function, e.g., maximum value (max-pooling). This simplifies the feature maps and greatly reduces the number of parameters in the network as the number of convolutional layers grows.

The main benefit of the CNN architecture is that much fewer parameters (weights and biases) need to be learned than for a conventional fully connected NN. This is due to the shift-invariance, i.e., the shared weights of the locally connected neurons in the hidden layers of the CNN and enables the construction of deeper networks that can learn much faster, without sacrificing performance.

In a 3D-CNN, which has 3D filters and 3D receptive fields, the output of the $n$ th feature map ${\bf y}^{n}$ at location $\{i,j,k\}$ is given by

[TABLE]

where $*$ denotes 3D-convolution, $b^{n}$ and ${\bf H}^{n}$ are the shared bias and filter (shared weights), respectively, $\sigma$ denotes the non-linear activation function and ${\bf x}$ is the input.

III Proposed Method

In this section, we first describe the proposed method and then discuss the chosen architecture of the 3D-CNN.

III-A General Outline of the Method

The observed MS image is denoted by ${\bf X}^{\text{MS}}$ and is of dimension $M\times N\times P$ , where $P$ is the number of spectral bands. The observed $m\times n\times q$ HS image is denoted by ${\bf X}^{\text{HS}}$ . The training of the 3D-CNN is performed as shown in Fig. 1 and described below.

To simplify the notation, we use the same symbols for 3D-matrices, i.e., images, and normal matrices. The implicit reshaping of a 3D image into a matrix with vectorized images (bands) in the columns is assumed. A tilde above a symbol denotes interpolation (upsampling followed by spatial filtering), a hat above a symbol denotes an estimate, and concatenation/stacking of matrices/images is denoted by square brackets, e.g., $[{\bf X}~{}{\bf Y}]$ .

Dimensionality reduction of ${\bf X}^{\text{HS}}$ using PCA. Singular value decomposition gives

[TABLE]

where the $mn\times q$ matrix ${\bf G}={\bf V}{\bf D}$ contains the spatial loadings and the $q\times q$ matrix ${\bf U}$ contains the spectral singular vectors. The first $r$ columns of ${\bf G}$ are used to form an $m\times n\times r$ image, ${\bf G}^{r}$ . 2. 2.

The image ${\bf X}^{\text{MS}}$ is spatially decimated by the resolution factor between the MS and HS images, using a bicubic decimation filter, to yield ${\bf X}_{\text{LR}}^{\text{MS}}$ , of dimension $m\times n\times P$ . Similarly, ${\bf G}^{r}$ , is spatially decimated and then interpolated (using a bicubic filter) to the dimension of ${\bf X}_{\text{LR}}^{\text{MS}}$ , yielding $\tilde{{\bf G}}_{\text{LR}}^{r}$ . 3. 3.

${\bf X}_{\text{LR}}^{\text{MS}}$ and $\tilde{{\bf G}}_{\text{LR}}^{r}$ are stacked to obtain the $m\times n\times(r+P)$ input image ${\bf X}_{\text{LR}}=[{\bf X}_{\text{LR}}^{\text{MS}}~{}\tilde{{\bf G}}_{\text{LR}}^{r}]$ . The target data are the first $r$ bands of ${\bf G}$ , i.e., ${\bf G}^{r}$ . ${\bf X}_{\text{LR}}$ and ${\bf G}^{r}$ , are randomly divided into a number of matching patches (overlapping) of size ${7\times 7}$ pixels, of depth $r+P$ and $r$ , for inputs and targets, respectively.

The fusion part of the method is depicted in Fig. 2. The trained 3D-CNN can accept the entire input data at once, without having to break it down into patches, since it has learned all its filters. The input ${\bf X}$ to the trained 3D-CNN consists of the stacked MS image, ${\bf X}^{\text{MS}}$ , and the first $r$ spatial loadings of ${\bf G}$ , which have been interpolated to the size of ${\bf X}^{\text{MS}}$ , i.e., ${\bf X}=[{\bf X}^{\text{MS}}~{}\tilde{{\bf G}}^{r}]$ . The output of the 3D-CNN is the estimated high resolution loadings, $\hat{{\bf G}}_{\text{HR}}^{r}$ .

The final step is the reconstruction of the estimated high resolution HS image, $\hat{{\bf X}}_{\text{HR}}^{\text{HS}}$ , via

[TABLE]

where $\hat{{\bf G}}_{\text{HR}}=[\hat{{\bf G}}_{\text{HR}}^{r}~{}\tilde{{\bf G}}^{q-r}]$ and $\tilde{{\bf G}}^{q-r}$ are the remaining $q-r$ interpolated spatial loadings obtained from the observed HS image, ${\bf X}^{\text{HS}}$ , and the matrix ${\bf U}$ contains the spectral singular vectors of ${\bf X}^{\text{HS}}$ .

There are two options for the reconstruction of the estimated fused image. The first option is the one described above, where the first $r$ loadings in $\tilde{{\bf G}}$ are replaced by the high resolution estimate $\hat{{\bf G}}_{\text{HR}}^{r}$ . If the HS image is noisy, a second option is to retain only the first $r$ PCs, i.e., performing the inverse PCA transform using the reduced $\hat{{\bf G}}_{\text{HR}}^{r}$ and ${\bf U}^{r}$ matrices, yielding $\hat{{\bf X}}_{\text{HR}}^{\text{HS}}=\hat{{\bf G}}_{\text{HR}}^{r}{{\bf U}^{r}}^{T}$ , where ${\bf U}^{r}$ denotes the reduced matrix ${\bf U}$ .

III-B CNN Architecture

In this work, we have decided to use a 3D-CNN architecture since an HS image has two spatial dimensions and one spectral dimension, and a 3D-CNN learns spectral-spatial features. If the input to a convolutional layer is an $M\times N\times P$ image, and the filter size is $i\times j\times k$ , the resulting feature map is of size $M-i+1\times N-j+1\times P-k+1$ . To preserve the size of the input image through the layers of the 3D-CNN, and to avoid boundary artifacts due to the convolution operations, the input to a convolutional layer with filter size $i\times j\times k$ , needs to be zero-padded by $(i-1)/2$ zeros at each end of the first dimension, $(j-1)/2$ for the second dimension, and $(k-1)/2$ for the third dimension.

The 3D-CNN used in our experiments has 3 convolutional layers with 32, 64 and $r$ filters, respectively, where $r$ is the number of spatial loadings to sharpen. The filter sizes for the first two convolutional layers were chosen equal to $3\times 3\times 3$ , and $1\times 1\times 1$ for the last convolutional layer. Each convolutional layer is preceded by a zero-padding layer and followed by a Gaussian noise regularization layer (except for the output layer), which adds zero-mean Gaussian noise to the output of the previous layer. This helps to reduce overfitting in the network, and is a form of random data augmentation [17]. The first two convolutional layers have rectified linear unit (ReLU) activation functions, i.e., $\sigma(x)=\text{max}(x,0)$ , while the output layer has linear activation.

The input shape for the convolutional layers is flexible. In other words, the 3D-CNN can be trained using a specific patch size, and when the CNN has been trained, the entire input can be estimated at once. This can be very memory consuming if the input image is large, and therefore PCA dimensionality reduction helps to significantly reduce the memory overhead in the fusion process. The layers of the CNN are summarized in Table I.

IV Experiments

IV-A Simulated Data

The HS image used in the experiments is the ROSIS Pavia center dataset 111The ROSIS data was kindly provided by Prof. P. Gamba from the University of Pavia, Italy.. Its dimension is $512\times 512$ pixels with 102 spectral bands. However, there is a blank strip along the left side and thus we only use 480 pixels along the row dimension.

The MS image is simulated from the HS image by averaging bands of the HS image according to the spectral response profiles of the R, G, B and NIR bands of the IKONOS MS sensor. We spatially decimate the observed HS image by a factor of 4 using a bicubic decimation filter to obtain the lower spatial resolution HS image. This yields the images to be fused, i.e., an MS image of dimension $512\times 480$ pixels with 4 spectral bands, and an HS image of dimension of $128\times 120$ pixels and 102 spectral bands. The original HS image is used as the reference image for the quantitative quality evaluation.

The method was implemented in the Python programming language using the Keras DL library which runs on top of the Theano backend 222http://keras.io, https://github.com/Theano and the computations were performed using an Intel i5-2400 [email protected] GHz with 16GB of RAM.

IV-B Results

As described in Section III-A, the simulated MS image and HS spatial loadings are spatially decimated and used as the input to the 3D-CNN. The target data are the first $r$ spatial loadings of the HS image. The training data consist of 8192 randomly chosen and matched patches of spatial size $7\times 7$ pixels, from the input and target data.

The network objective function is the mean squared error (MSE) between the target patch and the estimated patch. We use the adaptive moment estimation (ADAM) [18] optimizer and we use the values for the optimizer parameters as given in [18]. The number of training epochs is equal to 50, and it was assured that the objective function had fully converged during the training. Finally, the batch size is equal to 5, and the variance for the Gaussian noise regularization layer is equal to 0.5. The batch size is the number of samples that are propagated through the CNN at each time. We found that a small batch size gives better results and faster convergence.

The proposed method is compared to the method in [4], and the extended version of that method in [5]. We refer to these methods as MAP1 and MAP2, respectively. Both comparison methods are based on maximum a posteriori (MAP) estimation of wavelet coefficients, while the MAP2 method is identical to MAP1, except for PCA dimensionality reduction, in a similar manner to the proposed method.

We begin by investigating the effect of the number of sharpened PCs (spatial loadings) on the performance of the proposed and MAP2 methods in terms of the ERGAS [19] metric. The following number of PCs are considered: 2, 6, 10, 15, 20, 25 and 30. The results of this experiment are shown in Fig. 3. According to the figure, 10 PCs give optimal results for both methods. In the following experiments, 10 PCs are used for these methods.

The next experiment is the evaluation of the fusion performance for all methods in terms of the ERGAS [19], SAM [20] and SSIM [21] quantitative quality metrics, without and with additive zero-mean Gaussian noise (SNR=20dB). The results of the quantitative quality evaluation are summarized in Table II. The upper half of the table gives the results without added noise. As shown there, the proposed method significantly outperforms the MAP1 and MA2 methods according to all three quality metrics. Of the comparison methods, MAP2 performs much better than MAP1. Obviously, the proposed method is more costly than the comparison methods in terms of computation time. By using a powerful graphical processing unit (GPU), the training time could be reduced by up to the order of magnitude 2, making the proposed method competitive in terms of computation time. Fig. 5 depicts only a small portion of the 102th band of the interpolated, reference, and estimated HS image for all methods. Visual inspection shows that the proposed method gives the best results.

The lower half of Table II summarizes the results obtained with a zero-mean Gaussian noise added to the HS image. Again, the proposed method performs significantly better in terms of the quality metrics. However, its noise tolerance is similar, or slightly less than for the MAP2 method. The MAP1 method, which does not use PCA prior to the fusion, performs significantly worse than the other methods in the presence of noise.

Next, the performance of all methods, in terms of the ERGAS metric, is compared when the SNR varies due to additive Gaussian noise, from 10 to 30 dB, in increments of 5 dB. The result of this experiment is shown in Fig. 4. The plot clearly emphasizes what was observed in the previous experiment. The proposed method performs best and the MAP1 method performs significantly worse.

Finally, the sensitivity of all methods w.r.t. the decimation filter used, is investigated. Three types of decimation filters are considered, i.e., bicubic, bilinear and nearest neighbor. The results for the methods measured by the ERGAS and SAM metrics are summarized in Table III. According to the table, bicubic decimation gives the best results for all the methods. Using bilinear decimation degrades the performance of the methods in terms of the ERGAS and SAM metrics, however, the proposed method and MAP2 are less affected than the MAP1 method. Finally, nearest neighbor decimation degrades the performance significantly more for the MAP1 and MAP2 methods, than for the proposed method, when compared to the results obtained using the bicubic decimation.

V Conclusions

In this paper, we proposed a new method for the fusion of MS and HS images using a 3D-CNN. An important component of the method is dimensionality reduction via PCA prior to the fusion. This decreases the computational cost significantly while having no impact on the quality of the fused image. In the presence of noise, the dimensionality reduction can improve the result. The proposed method is compared to two methods based on MAP estimation. Experiments using a simulated dataset demonstrated that the proposed method gives good results and is also tolerant to noise in the HS image.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] G. Vivone, L. Alparone, J. Chanussot, M. Dalla Mura, A. Garzelli, G. Licciardi, R. Restaino, and L. Wald, “A critical comparison among pansharpening algorithms,” IEEE Transactions on Geoscience and Remote Sensing , vol. 53, no. 5, pp. 2565–2586, May 2015.
2[2] F. Palsson, J. R. Sveinsson, and M. O. Ulfarsson, “A new pansharpening algorithm based on total variation,” IEEE Geoscience and Remote Sensing Letters , vol. 11, no. 1, pp. 318–322, Jan. 2014.
3[3] X. He, L. Condat, J. Bioucas-Dias, J. Chanussot, and J. Xia, “A new pansharpening method based on spatial and spectral sparsity priors,” IEEE Transactions on Image Processing , vol. 23, no. 9, pp. 4160–4174, Sept. 2014.
4[4] Y. Zhang, S. De Backer, and P. Scheunders, “Noise-resistant wavelet-based bayesian fusion of multispectral and hyperspectral images,” IEEE Transactions on Geoscience and Remote Sensing , vol. 47, no. 11, pp. 3834–3843, Nov. 2009.
5[5] F. Palsson, J. R. Sveinsson, M. O. Ulfarsson, and J. A. Benediktsson, “Model-based fusion of multi- and hyperspectral images using PCA and wavelets,” IEEE Transactions on Geoscience and Remote Sensing , vol. 53, no. 5, pp. 2652–2663, May 2015.
6[6] Z. H. Nezhad, A. Karami, R. Heylen, and P. Scheunders, “Fusion of hyperspectral and multispectral images using spectral unmixing and sparse coding,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , vol. 9, no. 6, pp. 2377–2389, June 2016.
7[7] Y. Zhang, Y. Wang, Y. Liu, C. Zhang, M. He, and S. Mei, “Hyperspectral and multispectral image fusion using CNMF with minimum endmember simplex volume and abundance sparsity constraints,” in 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS) , July 2015, pp. 1929–1932.
8[8] Y. Zhang and M. He, “Multi-spectral and hyperspectral image fusion using 3-d wavelet transform,” Journal of Electronics (China) , vol. 24, no. 2, pp. 218–224, 2007.