Deep Learning for Multiple-Image Super-Resolution

Michal Kawulok; Pawel Benecki; Szymon Piechaczek; Krzysztof; Hrynczenko; Daniel Kostrzewa; Jakub Nalepa

arXiv:1903.00440·cs.CV·June 24, 2020

Deep Learning for Multiple-Image Super-Resolution

Michal Kawulok, Pawel Benecki, Szymon Piechaczek, Krzysztof, Hrynczenko, Daniel Kostrzewa, Jakub Nalepa

PDF

1 Repo

TL;DR

This paper introduces a novel deep learning-based method for multiple-image super-resolution, leveraging image fusion and neural networks to improve resolution beyond existing single-image and multi-image techniques.

Contribution

It presents a new deep learning approach that combines multiple-image fusion with low-to-high resolution mapping, outperforming current state-of-the-art methods.

Findings

01

Outperforms existing SRR methods in experiments

02

Effective fusion of multiple low-resolution images

03

Deep learning enhances super-resolution accuracy

Abstract

Super-resolution reconstruction (SRR) is a process aimed at enhancing spatial resolution of images, either from a single observation, based on the learned relation between low and high resolution, or from multiple images presenting the same scene. SRR is particularly valuable, if it is infeasible to acquire images at desired resolution, but many images of the same scene are available at lower resolution---this is inherent to a variety of remote sensing scenarios. Recently, we have witnessed substantial improvement in single-image SRR attributed to the use of deep neural networks for learning the relation between low and high resolution. Importantly, deep learning has not been exploited for multiple-image SRR, which benefits from information fusion and in general allows for achieving higher reconstruction accuracy. In this letter, we introduce a new method which combines the advantages…

Tables2

Table 1. TABLE I: Reconstruction accuracy and processing times for artificially degraded images (best scores are marked as bold).

Algorithm	IFC	PSNR	PSNR_HF	PSNR_LS	SSIM	UIQI	VIF	Time (s)
SR-DWT [4]	2.281	28.833	40.580	38.613	0.813	0.757	0.458	4
ResNet [15]	2.517	28.773	34.038	33.470	0.823	0.749	0.453	30
GPA [19]	2.436	28.054	32.924	32.522	0.792	0.712	0.422	15
SR-ADE [20]	2.289	27.237	32.049	31.742	0.756	0.666	0.378	17
EvoIM [11]	3.190	31.185	39.067	38.166	0.863	0.801	0.561	4
EvoNet^A	2.979	32.929	41.522	41.437	0.919	0.864	0.596	161
EvoNet	3.256	35.065	44.839	44.645	0.948	0.902	0.661	118
EvoNet^A—image registration performed for ResNet outputs

Table 2. TABLE II: Reconstruction accuracy for three Sentinel-2 images (the best scores are marked as bold).

	Algorithm	IFC	PSNR	PSNR_HF	PSNR_LS	SSIM	UIQI	VIF
Sydney	SR-DWT [4]	1.146	14.883	32.306	29.717	0.345	0.284	0.125
	ResNet [15]	1.070	14.533	32.609	31.029	0.292	0.176	0.105
	GPA [19]	1.191	16.619	31.928	30.710	0.398	0.236	0.121
	SR-ADE [20]	1.375	17.250	30.349	29.289	0.467	0.302	0.132
	EvoIM [11]	1.271	16.384	34.657	32.560	0.429	0.314	0.129
	EvoNet	1.387	16.722	34.349	32.607	0.487	0.334	0.139
Bushehr	SR-DWT [4]	1.032	15.432	36.475	34.403	0.344	0.199	0.087
	ResNet [15]	1.194	15.481	37.072	35.997	0.424	0.233	0.098
	GPA [19]	1.285	14.827	35.135	34.168	0.474	0.253	0.114
	SR-ADE [20]	1.185	14.704	33.804	32.963	0.458	0.218	0.102
	EvoIM [11]	1.134	14.470	37.237	35.956	0.362	0.227	0.098
	EvoNet	1.261	14.528	36.878	35.739	0.433	0.261	0.109
Bandar Abbas	SR-DWT [4]	1.031	18.697	36.021	34.542	0.419	0.221	0.092
	ResNet [15]	1.395	19.385	38.714	37.657	0.561	0.292	0.130
	GPA [19]	1.419	16.414	35.736	34.900	0.551	0.292	0.140
	SR-ADE [20]	1.305	16.187	33.381	32.634	0.521	0.249	0.124
	EvoIM [11]	1.148	16.068	37.158	35.909	0.414	0.255	0.114
	EvoNet	1.494	16.226	39.350	38.162	0.527	0.318	0.153
Mean scores	SR-DWT [4]	1.070	16.337	34.934	32.887	0.369	0.234	0.101
	ResNet [15]	1.220	16.467	36.132	34.894	0.426	0.234	0.111
	GPA [19]	1.299	15.953	34.266	33.259	0.474	0.260	0.125
	SR-ADE [20]	1.288	16.047	32.512	31.629	0.482	0.256	0.119
	EvoIM [11]	1.184	15.641	36.351	34.809	0.402	0.265	0.114
	EvoNet	1.381	15.825	36.859	35.503	0.482	0.304	0.134

Equations2

Δ X = - β [B^{'} A^{T} sgn (A B X_{n} - A X_{0}) + λ \frac{δ U ( X )}{δ X} (X_{n})],

Δ X = - β [B^{'} A^{T} sgn (A B X_{n} - A X_{0}) + λ \frac{δ U ( X )}{δ X} (X_{n})],

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ajinkya933/Image_repo
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Deep Learning for Multiple-Image Super-Resolution

Michal Kawulok

Pawel Benecki

Szymon Piechaczek

Krzysztof Hrynczenko

Daniel Kostrzewa

and Jakub Nalepa

The reported work was funded by European Space Agency (SuperDeep project, realized by Future Processing). MK and JN were partially supported by the National Science Centre under Grant DEC-2017/25/B/ST6/00474. PB and DK were supported by the Silesian University of Technology, Poland, funds no. BKM-509/RAu2/2017. M. Kawulok, P. Benecki, S. Piechaczek, K. Hrynczenko, D. Kostrzewa, and J. Nalepa are with Future Processing, Gliwice, Poland and with Silesian University of Technology, Gliwice, Poland (e-mail: [email protected]).

Abstract

Super-resolution reconstruction (SRR) is a process aimed at enhancing spatial resolution of images, either from a single observation, based on the learned relation between low and high resolution, or from multiple images presenting the same scene. SRR is particularly valuable, if it is infeasible to acquire images at desired resolution, but many images of the same scene are available at lower resolution—this is inherent to a variety of remote sensing scenarios. Recently, we have witnessed substantial improvement in single-image SRR attributed to the use of deep neural networks for learning the relation between low and high resolution. Importantly, deep learning has not been exploited for multiple-image SRR, which benefits from information fusion and in general allows for achieving higher reconstruction accuracy. In this letter, we introduce a new method which combines the advantages of multiple-image fusion with learning the low-to-high resolution mapping using deep networks. The reported experimental results indicate that our algorithm outperforms the state-of-the-art SRR methods, including these that operate from a single image, as well as those that perform multiple-image fusion.

Index Terms:

Super-resolution, deep learning, convolutional neural networks, image processing

I Introduction

Super-resolution reconstruction (SRR) is aimed at generating a high-resolution (HR) image from a single or multiple low-resolution (LR) observations. In many cases, the SRR algorithms are the only possibility to obtain images of sufficient spatial resolution, as HR data may not be available due to high acquisition costs or sensor limitations. Such situations are an inherent problem to remote sensing, in particular concerning satellite imaging for Earth observation purposes.

The existing approaches towards SRR can be categorized into single-image and multiple-image methods. The former consist in learning the LR-HR relation from a large number of examples. This relation allows us to reconstruct an HR image from an LR scene (unseen during training). Multiple-image SRR is based on information fusion, which benefits from the differences (mainly subpixel shifts) between LR images—in general, these approaches allow for more accurate reconstruction than single-image SRR, as they combine more data extracted from the analyzed scene. The recent advancements in deep learning, especially in deep convolutional neural networks (CNNs), have greatly improved single-image SRR, however it is worth noting that correct fusion of multiple LR images still offers higher reconstruction accuracy. Despite that, to the best of our knowledge, deep learning has not been employed for multiple-image SRR.

In this letter, our contribution lies in combining the advantages of single-image SRR based on deep learning with the benefits of information fusion offered by multiple-image reconstruction (Section II presents the related work). We introduce EvoNet (Section III), which employs a deep residual network, more specifically ResNet [15], to enhance the capabilities of evolutionary imaging model (EvoIM) [11] for multiple-image SRR. The results of our extensive experimental validation (Section IV) focused on satellite imaging are highly encouraging and they show that EvoNet renders qualitatively and quantitatively better outcome than the state-of-the-art techniques for single-image and multiple-image SRR.

II Related Work

In this section, we outline the state of the art in multiple-image SRR (Section II-A), and we present the recent advancements in using deep learning for SRR (Section II-B).

II-A Multiple-image super-resolution reconstruction

Existing techniques for multiple-image SRR are based on the premise that each LR observation $\mathcal{I}_{i}^{(l)}$ in a set $\bm{I}^{(l)}=\left\{\mathcal{I}_{i}^{(l)}:i\in\left\{1,2,\cdots,N\right\}\right\}$ has been derived from an original HR image $\mathcal{I}^{(h)}$ , degraded using an assumed imaging model (IM) that usually includes image warping, blurring, decimation and contamination with the noise. The reconstruction consists in reversing that degradation process, which requires solving an ill-posed optimization problem, therefore most SRR techniques employ some regularization to provide spatial smoothness of the reconstructed HR image $\mathcal{I}^{(sr)}$ . In one of the earliest approaches, Irani and Peleg performed SRR relying on image registration (hence reducing the IM to subpixel shifts) [10]. A hierarchical subpixel displacement estimation was combined with the Bayesian reconstruction in the gradient projection algorithm (GPA) [19]. Another popular optimization technique applied here is the projection onto convex sets [1], which consists in updating the HR target image iteratively based on the error measured between $\mathcal{I}^{(l)}$ and a downsampled version of the reconstruction outcome $\mathcal{I}^{(sr)}$ , degraded using the assumed IM. Farsiu et al. introduced fast and robust super-resolution (FRSR) [8] based on maximum likelihood estimation coupled with simplified regularization—importantly, the error is measured in the HR coordinates, thus avoiding the expensive scaling operation. Among other methods, adaptive Wiener filter [9] and random Markov fields [16] were used to specify the IM. Zhu et al. proposed adaptive detail enhancement (SR-ADE) [20] for reconstructing satellite images—a bilateral filter is employed to decompose the input images and amplify the high-frequency detail information.

Recently, we proposed the EvoIM method [11, 12], which employs a genetic algorithm to optimize the hyper-parameters that control the IM used in FRSR [8], and to evolve the convolution kernels instead of the Gaussian blur used in FRSR. We showed that the reconstruction process can be effectively adapted to different imaging conditions—in particular, we used Sentinel-2 images at original resolution as LR inputs, and compared the reconstruction outcome with SPOT images presenting the same region.

II-B Deep learning for single-image super-resolution

Inspired by earlier approaches based on sparse coding [3], Dong et al. proposed super-resolution CNN (SRCNN) [5], followed by its faster version (FSRCNN) [6], for learning the LR-to-HR mapping from a number of LR–HR image pairs. Despite relatively simple architecture, SRCNN outperforms the state-of-the-art example-based methods. Liebel and Korner have successfully trained SRCNN with Sentinel-2 images, improving its capacities of enhancing satellite data [17]. The same architecture was used to improve spatial resolution of sea surface temperature maps [7]. Kim et al. addressed certain limitations of SRCNN with a very deep super-resolution network [13] which can be efficiently trained relying on fast residual learning. The domain expertise was exploited using a sparse coding network [18], which achieves high training speed and model compactness. Lai et al. proposed deep Laplacian pyramid networks with progressive upsampling [14], aimed at achieving high processing speed. Recently, generative adversarial networks (GANs) are being actively explored for SRR [15]. GANs are composed of a generator (ResNet in [15]), trained to perform SRR, whose outcome is classified by a discriminator, learned to distinguish between the images reconstructed by the generator and the real HR images (used for reference). In this way, the generator is promoted for generating images that are hard to distinguish from the real ones, thus it also learns avoiding the artifacts.

III The proposed EvoNet algorithm

A flowchart of the proposed method is presented in Fig. 1. First of all, each of LR input images ( $\mathcal{I}_{i}^{(l)}$ ) is subject to single-image SRR using ResNet. This step produces a set of $N$ images $\bm{I}^{(rn)}=\{\mathcal{I}_{i}^{(rn)}\}$ , whose dimensions are $2\times$ larger than those of $\mathcal{I}_{i}^{(l)}$ . In parallel to that, the LR input set $\bm{I}^{(l)}$ undergoes image registration to determine subpixel shifts between the images. The obtained single-image SRR outcomes ( $\bm{I}^{(rn)}$ ) alongside the subpixel shifts allow for composing the initial HR image $\mathcal{X}_{0}$ using the median shift-and-add method (the dimensions are increased again $2\times$ , hence $4\times$ compared with $\mathcal{I}_{i}^{(l)}$ ). Finally, $\mathcal{X}_{0}$ is subject to the iterative EvoIM process, which produces the final reconstruction outcome $\mathcal{I}^{(sr)}$ .

III-A Residual neural network applied to the input images

Each LR image $\mathcal{I}_{i}^{(l)}$ is independently enhanced using ResNet to obtain a higher-quality input data ( $\mathcal{I}_{i}^{(rn)}$ ) for further multiple-image fusion. For this purpose, we exploit the architecture described in [15], which is composed of 16 residual blocks with skip connections, and it is trained employing the mean square error (MSE) as the loss function (during training, ResNet is guided to reduce MSE between each HR image and the reconstruction outcome obtained from the artificially-degraded HR image). For EvoNet, we modify the final layer, which determines the upscaling factor ( $2\times$ in our case, compared with $4\times$ in [15]).

III-B Multiple-image fusion

The EvoIM process, which we employ for multiple-image fusion, consists in iterative filtering of an HR image $\mathcal{X}_{0}$ , composed of registered LR inputs. In EvoNet, we register the original $\mathcal{I}_{i}^{(l)}$ images, before they are processed with ResNet (the ResNet reconstruction does not introduce any information that may contribute to better assessment of the displacement values). As the dimensions of the ResNet outputs are $2\times$ larger than those of $\mathcal{I}_{i}^{(l)}$ , the computed shift values are multiplied by 2 to compose $\mathcal{X}_{0}$ . Subsequently, EvoIM solves the optimization problem (analogously to the FRSR method). The update step $\Delta\mathcal{X}=\mathcal{X}_{n+1}-\mathcal{X}_{n}$ is computed as:

[TABLE]

where $\beta$ is a hyper-parameter that controls the update step, $\bm{A}$ is a diagonal matrix representing the number of the LR measurements that contributed to $\mathcal{X}_{0}$ , $U(\mathcal{X})$ is the regularization term controlled with the $\lambda$ hyper-parameter, while $\bm{B}$ and $\bm{B}^{\prime}$ are $5\times 5$ convolution kernels (in FRSR, $\bm{B}$ is the Gaussian blur and $\bm{B}^{\prime}=\bm{B}^{T}$ ). The hyper-parameters alongside the convolution kernels are optimized during the EvoIM evolutionary training. Importantly, ResNet and EvoIM are trained separately before they are combined within the EvoNet framework.

IV Experiments

For validation, we used three types of data in the test set, namely: (i) artificially-degraded (AD) images—10 scenes, for each a set $\bm{I}^{(l)}$ obtained from an HR image $\mathcal{I}^{(h)}$ with $N=4$ different subpixel shifts applied before further degradation, each $\mathcal{I}_{i}^{(l)}$ of size $500\times 500$ pixels, (ii) real satellite (RS $+$ ) images of the same region, acquired at different resolution—we used three Sentinel-2 scenes as LR ( $N=10$ LR images in each scene), two of which are matched with SPOT images (presenting Bushehr, Iran, LR of size $300\times 291$ pixels, and Bandar Abbas, Iran, $240\times 266$ pixels) and one is matched with Digital Globe WorldView-4 image (Sydney, Australia, $92\times 90$ pixels), and (iii) real satellite images available without any higher-resolution reference (RS $-$ , over 20 scenes). For AD and RS $+$ , we quantify the reconstruction quality based on the similarity between $\mathcal{I}^{(h)}$ and $\mathcal{I}^{(sr)}$ , and for RS $-$ , we rely exclusively on subjective qualitative assessment (as no reference is available). The reconstruction outcome is evaluated quantitatively at the dimensions $2\times$ larger than for input LR images (EvoNet and ResNet enlarge LR images $4\times$ , so we downscale these outcomes $2\times$ for fair comparison with the remaining methods). For RS $+$ , $\mathcal{I}^{(sr)}$ is compared with Digital Globe and SPOT images, downscaled to fit the dimensions of $\mathcal{I}^{(sr)}$ . In addition to peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM), we measure the similarity using more advanced metrics [2]: information fidelity criterion (IFC), visual information fidelity (VIF), universal image quality index (UIQI), and PSNR for images treated with a high-pass filter (PSNRHF) and local standard deviation (PSNRLS). For all these metrics, higher values indicate higher similarity between the reconstruction outcome and the reference image.

EvoNet is compared with two single-image SRR methods: SRR based on wavelet transform (SR-DWT) [4] and ResNet [15], and with three multiple-image ones: GPA [19], SR-ADE [20], and EvoIM [11]. EvoIM (also exploited in EvoNet) was trained separately for artificially-degraded images and for real satellite data, as reported in [11], using PSNRHF [2] as the fitness function (there were no overlaps between training and test sets). ResNet was trained using images from the DIV2K dataset111DIV2K dataset is available at https://data.vision.ee.ethz.ch/cvl/DIV2K. We implemented all the investigated algorithms in C++, and we used Python with Keras to implement ResNet. The experiments were run on an Intel i5 3.2 GHz computer with 16 GB RAM, and ResNet was trained on a GTX 1060 6 GB GPU.

In Table I, we report the reconstruction accuracy for AD images alongside the processing times. For fair comparison, all the reconstruction tests were run on a CPU, which explains long times of ResNet and EvoNet (GPU was used only for training ResNet). EvoNet allows for the most accurate reconstruction, rendering consistently best scores, and multiple-image EvoIM renders higher scores than single-image SR-DWT and ResNet. Examples of reconstruction are presented in Fig. 2—the outcome of ResNet is more blurred than EvoNet, with less details visible, and EvoIM produces definitely more artifacts; overall, EvoNet renders very plausible outcome, which most resembles the HR image. We have also tried to register the images after they are processed with ResNet—as expected, this decreases the reconstruction accuracy, while extending the processing time (see Table I).

Quantitative results obtained for RS $+$ images are reported in Table II (we also show the values averaged over three images). It can be seen that for Sydney and Bandar Abbas, EvoNet renders highest scores for most metrics (including IFC and VIF which were found most meaningful for assessing SRR [2]). For Bushehr, the scores differ less among the methods, and the metrics are not consistent in indicating the most accurate method—possibly because this image contains more plain areas compared with two remaining scenes. Average PSNR is highest for ResNet, which can be caused by using MSE as the loss function for training (PSNR is based on MSE). All other metrics indicate that EvoNet outperforms the remaining methods. From Fig. 3, it can be seen that the quantitative results are coherent with the visual assessment—all the methods increase the interpretation capacities compared with LR, and the outcome obtained using EvoNet recovers more details than ResNet, without introducing the artifacts visible for EvoIM.

The outcomes obtained for RS $-$ images (without any HR reference) generally confirm our observations discussed for RS $+$ images. In Fig. 4, we show an interesting example of reconstruction from Lunar Reconnaissance Orbiter Camera images. It is worth noting that these LR images contain some artifacts in a form of faint vertical stripes, which result from the sensor characteristics (the images were not preprocessed). In this case, not only does EvoNet render the highest reconstruction quality, but it also manages to make these artifacts less apparent compared with EvoIM and ResNet (this can be explained by the fact that ResNet changes the artifacts to be grid-like, which can be further reduced during the fusion).

V Conclusions

In this letter, we proposed a novel method for multiple-image super-resolution which exploits the recent advancements in deep learning. We demonstrated that the ResNet deep CNN applied to enhance each individual LR image before performing the multiple-image fusion, can substantially improve the final super-resolved image. The reported quantitative and qualitative results indicate that the proposed approach is highly competitive with the state of the art both in single-image SRR, as well as in multiple-image super-resolution.

Our ongoing work is aimed at developing deep architectures for learning the entire process of multiple-image reconstruction, possibly including image registration.

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] T. Akgun, Y. Altunbasak, and R. M. Mersereau, “Super-resolution reconstruction of hyperspectral images,” IEEE Trans. on Image Process. , vol. 14, no. 11, pp. 1860–1875, 2005.
2[2] P. Benecki, M. Kawulok, D. Kostrzewa, and L. Skonieczny, “Evaluating super-resolution reconstruction of satellite images,” Acta Astronautica , vol. 153, pp. 15–25, 2018.
3[3] H. Chavez-Roman and V. Ponomaryov, “Super resolution image generation using wavelet domain interpolation with edge extraction via a sparse representation,” IEEE Geoscience and Remote Sensing Letters , vol. 11, no. 10, pp. 1777–1781, Oct 2014.
4[4] H. Demirel and G. Anbarjafari, “Discrete wavelet transform-based satellite image resolution enhancement,” IEEE Trans. on Geoscience and Remote Sensing , vol. 49, no. 6, pp. 1997–2004, 2011.
5[5] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in Proc. ECCV . Springer, 2014, pp. 184–199.
6[6] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” in Proc. ECCV . Springer, 2016, pp. 391–407.
7[7] A. Ducournau and R. Fablet, “Deep learning for ocean remote sensing: An application of convolutional neural networks for super-resolution on satellite-derived SST data,” in Proc. WPRRS , 2016, pp. 1–6.
8[8] S. Farsiu, M. D. Robinson, M. Elad, and P. Milanfar, “Fast and robust multiframe super resolution,” IEEE Trans. on Image Process. , vol. 13, no. 10, pp. 1327–1344, 2004.