Ensemble Super-Resolution with A Reference Dataset
Junjun Jiang, Yi Yu, Zheng Wang, Suhua Tang, Ruimin Hu, and Jiayi Ma

TL;DR
This paper introduces an ensemble learning approach for single image super-resolution that uses a reference dataset to optimize component weights, outperforming existing methods.
Contribution
It proposes a MAP-based framework utilizing a reference dataset to determine optimal ensemble weights for super-resolution, unifying and improving upon prior SR methods.
Findings
Outperforms state-of-the-art non-deep learning methods.
Surpasses recent deep learning super-resolution techniques.
Demonstrates effectiveness on multiple public datasets.
Abstract
By developing sophisticated image priors or designing deep(er) architectures, a variety of image Super-Resolution (SR) approaches have been proposed recently and achieved very promising performance. A natural question that arises is whether these methods can be reformulated into a unifying framework and whether this framework assists in SR reconstruction? In this paper, we present a simple but effective single image SR method based on ensemble learning, which can produce a better performance than that could be obtained from any of SR methods to be ensembled (or called component super-resolvers). Based on the assumption that better component super-resolver should have larger ensemble weight when performing SR reconstruction, we present a Maximum A Posteriori (MAP) estimation framework for the inference of optimal ensemble weights. Specially, we introduce a reference dataset, which is…
| Dataset | SET14 | |||||
|---|---|---|---|---|---|---|
| Scale | ||||||
| Metric | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM |
| Bicubic | 30.24 | 0.8688 | 27.55 | 0.7742 | 26.00 | 0.7027 |
| Kim [31] | 32.14 | 0.9032 | 28.96 | 0.8144 | 27.18 | 0.744 |
| SelfExSR [53] | 32.22 | 0.9034 | 29.16 | 0.8196 | 27.40 | 0.7518 |
| A+ [39] | 32.28 | 0.8056 | 29.13 | 0.8188 | 27.32 | 0.7491 |
| IA [69] | 32.83 | 0.9110 | 29.63 | 0.8296 | 27.85 | 0.7643 |
| SRCNN [52] | 32.42 | 0.9063 | 29.28 | 0.8209 | 27.49 | 0.7503 |
| CSCN [58] | 32.56 | 0.9074 | 29.41 | 0.8238 | 27.64 | 0.7587 |
| CSCN-MV [58] | 32.80 | 0.9101 | 29.57 | 0.8263 | 27.81 | 0.7619 |
| VDSR [54] | 33.03 | 0.9124 | 29.78 | 0.8314 | 28.01 | 0.7674 |
| DRCN [55] | 33.04 | 0.9118 | 29.77 | 0.8312 | 28.02 | 0.7570 |
| ESCN [66] | 32.67 | 0.9093 | 29.51 | 0.8264 | 27.75 | 0.7611 |
| RefESR | 33.16 | 0.9134 | 29.90 | 0.8338 | 28.14 | 0.7702 |
| Dataset | SET5 | |||||
|---|---|---|---|---|---|---|
| Scale | ||||||
| Metric | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM |
| Bicubic | 33.66 | 0.9299 | 30.39 | 0.8682 | 28.42 | 0.8104 |
| Kim [31] | 36.24 | 0.9518 | 32.3 | 0.9032 | 30.07 | 0.8542 |
| SelfExSR [53] | 36.49 | 0.9537 | 32.58 | 0.9093 | 30.31 | 0.8619 |
| A+ [39] | 36.54 | 0.9544 | 32.59 | 0.9088 | 30.28 | 0.8603 |
| IA [69] | 37.37 | 0.9582 | 33.43 | 0.9186 | 31.05 | 0.8764 |
| SRCNN [52] | 36.66 | 0.9542 | 32.58 | 0.9093 | 30.86 | 0.8732 |
| CSCN [58] | 36.93 | 0.9552 | 33.10 | 0.9144 | 30.86 | 0.8732 |
| CSCN-MV [58] | 37.21 | 0.9571 | 33.34 | 0.9173 | 31.14 | 0.8189 |
| VDSR [54] | 37.53 | 0.9587 | 33.66 | 0.9213 | 31.35 | 0.8838 |
| DRCN [55] | 37.63 | 0.9588 | 33.82 | 0.9226 | 31.53 | 0.8854 |
| ESCN [66] | 37.14 | 0.9571 | 33.28 | 0.9173 | 31.02 | 0.8774 |
| RefESR | 37.71 | 0.9593 | 33.87 | 0.9224 | 31.55 | 0.8848 |
| Dataset | Urban100 | |||||
|---|---|---|---|---|---|---|
| Scale | ||||||
| Metric | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM |
| Bicubic | 26.88 | 0.8403 | 24.46 | 0.7349 | 23.14 | 0.6577 |
| Kim [31] | 28.71 | 0.8942 | 25.24 | 0.7761 | 23.53 | 0.6790 |
| SelfExSR [53] | 29.54 | 0.8967 | 26.44 | 0.8088 | 24.79 | 0.7374 |
| A+ [39] | 29.20 | 0.8938 | 26.03 | 0.7973 | 24.32 | 0.7183 |
| IA [69] | 29.93 | 0.9077 | 26.71 | 0.8106 | 24.93 | 0.7416 |
| SRCNN [52] | 29.50 | 0.8946 | 26.24 | 0.7989 | 24.52 | 0.7221 |
| CSCN [58] | 29.14 | 0.8988 | 25.58 | 0.7858 | 23.80 | 0.6924 |
| CSCN-MV [58] | 29.30 | 0.9015 | 25.70 | 0.7903 | 23.91 | 0.6984 |
| VDSR [54] | 30.76 | 0.9140 | 27.14 | 0.8279 | 25.18 | 0.7524 |
| DRCN [55] | 30.75 | 0.9133 | 27.15 | 0.8276 | 25.14 | 0.7510 |
| ESCN [66] | 29.25 | 0.8986 | 25.72 | 0.7912 | 23.99 | 0.6975 |
| RefESR | 30.88 | 0.9150 | 27.26 | 0.8285 | 25.28 | 0.7529 |
| Method | PSNR | SSIM |
|---|---|---|
| Best Component Super-Resolver | 29.78 | 0.8314 |
| Without Reconstruction Constraint | 29.89 | 0.8337 |
| Without Weights Prior | 29.70 | 0.8304 |
| Ensemble Via Averaging | 29.71 | 0.8301 |
| The Proposed Method | 29.90 | 0.8338 |
| Scale | ||||||
|---|---|---|---|---|---|---|
| Metric | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM |
| Dataset | SET14 | |||||
| EDSR [73] | 33.68 | 0.9172 | 30.34 | 0.8434 | 28.66 | 0.7845 |
| RefESR | 33.85 | 0.9194 | 30.45 | 0.8454 | 28.75 | 0.7862 |
| RefE2SR | 33.95 | 0.9203 | 30.61 | 0.8470 | 28.91 | 0.7873 |
| Dataset | SET5 | |||||
| EDSR [73] | 38.11 | 0.9601 | 34.64 | 0.9282 | 32.46 | 0.8968 |
| RefESR | 38.16 | 0.9607 | 34.66 | 0.9285 | 32.48 | 0.8970 |
| RefE2SR | 38.26 | 0.9611 | 34.92 | 0.9299 | 32.77 | 0.8996 |
| Dataset | Urban100 | |||||
| EDSR [73] | 32.93 | 0.9351 | 28.80 | 0.8653 | 26.64 | 0.9033 |
| RefESR | 33.02 | 0.9373 | 28.89 | 0.8672 | 26.73 | 0.9039 |
| RefE2SR | 33.19 | 0.9378 | 29.03 | 0.8698 | 26.96 | 0.9086 |
| Methods | ||||||
|---|---|---|---|---|---|---|
| PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | |
| Wang [74] | 27.73 | 0.7642 | 27.93 | 0.7564 | 27.01 | 0.7251 |
| NE [24] | 30.73 | 0.8587 | 29.19 | 0.8065 | 27.92 | 0.7682 |
| LSR [75] | 32.12 | 0.8969 | 28.70 | 0.7469 | 24.44 | 0.5269 |
| SR [17] | 32.21 | 0.8983 | 28.37 | 0.7238 | 23.96 | 0.4903 |
| LcR [76] | 32.23 | 0.8981 | 30.09 | 0.8275 | 30.29 | 0.8449 |
| SSR [77] | 32.34 | 0.8992 | 29.82 | 0.8445 | 28.56 | 0.8022 |
| DRP [78] | 32.60 | 0.9213 | 27.79 | 0.7102 | 23.21 | 0.4585 |
| RefESR | 33.13 | 0.9252 | 30.67 | 0.8500 | 30.98 | 0.8624 |
| Gains | 0.53 | 0.0039 | 0.58 | 0.0055 | 0.69 | 0.0175 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Advanced Image Fusion Techniques · Advanced Vision and Imaging
Ensemble Super-Resolution with A Reference Dataset
Junjun Jiang, Yi Yu, Zheng Wang, Suhua Tang, Ruimin Hu, and Jiayi Ma,
J. Jiang is with the School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China, and is also with the Peng Cheng Laboratory, Shenzhen, China ([email protected]). Y. Yu and Z. Wang are with the Digital Content and Media Sciences Research Division, National Institute of Informatics, Tokyo 101-8430, Japan ({yiyu, wangz}@nii.ac.jp). S. Tang is with the Department of Communication Engineering and Informatics, The University of Electro-Communications, Tokyo 182-8585, Japan ([email protected]). R. Hu is with the National Engineering Research Center for Multimedia Software, School of Computer, Wuhan University, Wuhan, 430072, China ([email protected]). J. Ma is with the Electronic Information School, Wuhan University, Wuhan 430072, China ([email protected]).
Abstract
By developing sophisticated image priors or designing deep(er) architectures, a variety of image Super-Resolution (SR) approaches have been proposed recently and achieved very promising performance. A natural question that arises is whether these methods can be reformulated into a unifying framework and whether this framework assists in SR reconstruction? In this paper, we present a simple but effective single image SR method based on ensemble learning, which can produce a better performance than that could be obtained from any of SR methods to be ensembled (or called component super-resolvers). Based on the assumption that better component super-resolver should have larger ensemble weight when performing SR reconstruction, we present a Maximum A Posteriori (MAP) estimation framework for the inference of optimal ensemble weights. Specially, we introduce a reference dataset, which is composed of High-Resolution (HR) and Low-Resolution (LR) image pairs, to measure the super-resolution abilities (prior knowledge) of different component super-resolvers. To obtain the optimal ensemble weights, we propose to incorporate the reconstruction constraint, which states that the degenerated HR image should be equal to the LR observation one, as well as the prior knowledge of ensemble weights into the MAP estimation framework. Moreover, the proposed optimization problem can be solved by an analytical solution. We study the performance of the proposed method by comparing with different competitive approaches, including four state-of-the-art non-deep learning based methods, four latest deep learning based methods and one ensemble learning based method, and prove its effectiveness and superiority on three public datasets.
Index Terms:
Super-resolution, ensemble learning, reference dataset, deep learning, Maximum A Posteriori (MAP).
I Introduction
Image Super-Resolution (SR) is a class of image processing technology which can infer a High-Resolution (HR) image from one or a sequence of Low-Resolution (LR) images [1]. It can transcend the limitations of current optical imaging systems, and has been widely applied in medical and remote sensing imaging, digital photographs, depth based 3D reconstruction, and intelligent video surveillance system [2, 3, 4].
The SR problem is a severely ill-posed inverse problem due to information loss during the image degradation process, e.g., image blurring, aliasing from subsampling and noise. How to reconstruct an HR image which looks pleasant from an LR one remains an extremely challenging task. The prior knowledge, such as piecewise smoothness [5, 6, 7], shape edges [8, 9], textures [10], local/nonlocal similar patterns [11, 12, 13, 14], low-rank constraint [15, 16], and sparse representations under certain transformations [17, 18, 19, 20, 21], have been investigated to regularize the SR reconstruction procedures. Generally speaking, the current methods fall into two general categories: multi-frame reconstruction approaches and learning-based single image SR approaches.
By making full use of the inter-frame complementary information, multi-frame reconstruction based SR approaches leverage a sequence of LR images of the same scene and fuses them to induce an HR output or a sequence of HR outputs. However, the sub-pixel registration is an exceedingly difficult problem and the magnification factor is limited in practice [22]. Learning-based single image SR methods aim at learning the relationship between the LR and HR example pairs, and then applying the learned transformation to predict missing details of an observed LR image. In this paper, we focus on the single image SR problem.
Since the pioneer work by Freeman et al. [23], single image SR problem has increasingly been studied and attracted great research interests in recent decades. For example, Chang et al. [24] introduced the locally linear embedding [25] based manifold learning theory into SR problem for the first time, and then a series of neighbor embedding algorithms have been proposed [8, 26, 27, 28, 29]. They can well exploit the local manifold structure of image patch space. To adaptively select the neighbor samples, Yang et al. [17] proposed to use sparse representation algorithm to adaptively choose the most relevant neighbors, avoiding over- or under-fitting of these neighbor embedding based method and obtaining better results [30, 31, 32, 33]. In order to overcome the inconsistency between the LR and HR spaces, quite a few coupled learning based methods have also been developed recently [34, 35, 36]. They are essentially in order to learn the relationship from one domain/space to another domain/space, i.e., from the LR space to the corresponding HR one. The approach of Timofte et al. [37] leverages the divide and conquer strategy to learn the mapping relationship between the LR and HR samples in multiple local neighbor spaces, and a fast single image SR method based on Anchored Neighborhood Regression (ANR) is developed. In order to further enhance the quality of mapping relationship, they further combine ANR with simple function based method [38] and proposed the Adjusted ANR (A+ for short) approach [39]. A+ studies the mapping relationship between the LR and HR samples in a much denser sample space, which can guarantee the performance of local linear regression. In addition to the work of [37, 38, 39], some regression algorithms also have been developed to directly learn the relationship between the LR samples and HR samples in a coarse-to-fine [40, 41], sparse [42, 43, 44], collaborative [45, 46], adaptive [9], local [47, 48], pairwise [49] or structured [50] manner. The above mentioned algorithms are simple, fast, and can well characterize the potential mapping between the LR and HR spaces (especially the local image patch space), and thus they produced very favorable performance.
Over the past few years, deep learning, the re-emergence of neural networks, has been tremendously and successfully used in a multitude of fields, such as self-driving cars, computer vision, speech recognition, and machine translation, and has achieved significant and impressive results [51]. Most recently, this technology has also been introduced to solve the image SR problem by learning the mapping relationship between the LR and HR samples in an end-to-end manner [52, 53, 54, 55, 56, 57, 58, 59]. Super-Resolution using Deep Convolutional Networks (SRCNN) [52], Cascade of Sparse Coding based Networks (CSCN) [58], Very Deep Convolutional Networks (VDSR) [54], and Deeply-Recursive Convolutional Networks (DRCN) [55] based deep learning SR techniques carefully design different network structures to meet the challenge of SR reconstruction. Specifically, SRCNN [52] constructs a three convolutional layers, while CSCN [58] cascades sparse coding networks. In [54], VDSR makes use of the deep model up to 20 weights layers to predict residual image between the HR images and LR ones. By this very deep network, it can use large receptive field and take a large image context into account, thus well capturing the image structure especially when the scale factor increase. DRCN [55] recursively leverages the same convolutional network as many times as desired while does not introduce additional parameters for additional convolutions. To get better human perception, a number of photo-realism based Generative Adversarial Networks (GAN) [60] have also been presented newly [61, 62].
However, the aforementioned methods based on different shallow prior models (local manifold structure prior or sparse prior) or different deep networks have their own advantages and capture different image details. Over the years, we have witnessed a constant effort to design a better performance for the SR problem. A natural question that arises is whether these methods can be reformulated into a unifying framework and whether this framework assists in SR task?
One very natural idea is to integrate the outputs of different SR methods (we call the SR algorithms to be ensembled as component super-resolvers in the following) in an ensemble learning framework and produce an output that is better than all component super-resolvers. Then, given a number of results obtained by the component super-resolvers, how to ensemble them to produce a better result? The most obvious way is directly averaging all the component super-resolvers equally. However, ensemble learning theory [63] has proved that it may be better to combine some instead of all of the learners. That is to say, when we know in advance that the performance of one component super-resolver is poor, we can remove it or set a relative small ensemble weigh in advance. So, the remaining question is how to determine whether a component super-resolver is superior or not. In other words, how to determine the ensemble weights is the essential problem in ensemble learning based SR problem.
In this paper, we contribute a simple but effective Ensemble learning SR algorithm with a Reference dataset, which is denoted as RefESR for short. Our method is inspired by external dataset based models. Unlike previously methods that learn prior knowledge for the parameters of one statistical model or the desired HR images, our method directly learn the SR abilities of different methods and use them to guide the optimization of ensemble parameters, i.e., the ensemble (or combination) weights. To estimate the optimal ensemble weights, in particular, the proposed RefESR method considers both the posterior reconstruction error deduced from the image degradation model and the ensemble weight prior learned from an additional reference dataset, and formulates them in a Maximum A Posteriori (MAP) framework. Moreover, we introduce a simple method to obtain an analytical solution of the ensemble parameters. Fig. 1 shows the pipeline of the proposed RefESR algorithm. To the best of our knowledge, this is the first time to leverage an additional reference dataset to guide the SR reconstruction. Although many previous works have presented to use an additional dataset to exploit the natural image prior, our proposed method directly leverages a reference dataset to obtain the SR ability (in terms of objective qualities) of different component super-resolvers, and applies it to guide the subsequence SR reconstruction. Experimental results demonstrate that our RefESR method is better than state-of-the-art deep learning based SR methods. Moreover, our method is very general and it can be used to ensemble the best methods fed into our framework to improve the SR performance, thus expecting to always achieve the best reconstruction results.
The following paragraphs of this paper are organized as follows: In Section II, we present some related works of ensemble learning based SR approaches. Section III introduces the proposed ensemble SR framework and the objective function optimization method in detail. The experimental results are presented in Section IV. Some deep analysis and discussions to the proposed ensemble learning framework are presented in Section V. Finally, we conclude this work in Section VI.
II Related Work
In statistics and machine learning, ensemble learning method is a powerful way to produce a better performance than that could be obtained from any of the component methods. It has been widely applied in the fields of data mining and pattern recognition [64]. Although ensemble learning has achieved great success in machine learning problems, it has not been applied to image SR. Until most recently, two ensemble learning related SR methods have been proposed.
In [65], a video SR method is presented. They decompose the video SR task into two stages: draft-ensemble generation and determine the optimal one via convolutional neural network deep learning. In essence, they leveraged the deep learning networks to select the candidate HR samples in the patch space, and this is the general idea of lots of learning-based SR methods. Through it is termed as ensemble-based, it is not strictly ensemble SR method because selecting the best samples for the following reconstruction is the basic idea of many learning based SR methods [23, 24, 17]. The other work is proposed by Wang et al. [66], they introduced the ensemble learning into the SR problem and proposed an ensemble based deep networks method for image SR. It focuses on one deep learning based SR method, and generates different models by different initializations of one specific neural network. Specifically, they took sparse coding based networks [58] as baseline, and developed an Ensemble based Sparse Coding Networks (ESCN) by changing the initializations of SCN [58]. In ESCN, the ensemble weights are adaptively determined by a back-projection model.
ESCN based SR method has achieved better performance than the original SCN method [58], however, there are two limitations: Firstly, it essentially integrates only one deep learning model, SCN based neural network, with multiple outputs under different initial conditions. Unfortunately, due to the limited capacity of the same network, the complementary information obtained by only changing the initialization is insufficient, thus the improvement of the ensemble result is limited. Secondly, it only considers the reconstruction constraints when determining the ensemble weights and no other prior has been taken into consideration. Their model is actually ill-posed, and there are many solutions to meet its objective function. From their experiments we can also find that the optimal ensemble weights and average weights obtained almost the same results. Therefore, it is not really effective to consider only reconstruction constraints. In contrast, our proposed method ensembles a variety of different methods, including traditional state-of-the-art learning based methods and deep learning based methods with different neural networks emerged in recent years. Moreover, we introduce a reference dataset to measure the performance of different SR methods, which can be seen as the model prior and is incorporated into to our objective function as a regularization term.
III Proposed Method
In this section, we present the proposed RefESR method in detail. We firstly give the problem definition of RefESR in a Bayesian framework. Then, we show how to model the reconstruction constraint and the prior of ensemble weights. And then, we induce out the objective function of our proposed RefESR method. After that, we describe an analytical way to solve the optimization problem.
III-A Problem Setup
In our proposed ensemble learning based SR method, we can obtain the SR reconstruction results, , of different methods, , for the observed LR image, x. Here, can be seen as the -th SR model. Given and x in the ensemble SR framework, our aim is to infer the optimal ensemble weights, , where is associated with the -th SR model . After obtaining the optimal ensemble weights, we can predict the HR output of LR input by
[TABLE]
Under the Bayesian framework, the regularized SR problem is related to a probabilistic model as follows:
[TABLE]
Notice that the marginal likelihood, , does not depend on w. With the observation of and , the MAP estimation of w can be formulated as,
[TABLE]
where the first term is the likelihood term and the second term denotes the prior knowledge of the w. By the definition of the likelihood term and the prior term, we can maximize the objective function (III-A) to obtain the optimal ensemble weights . Acquiring the optimal ensemble weights, we can expect to infer the target HR output.
III-B Reconstruction Constraint Modeling
For single image SR problem, the relationship between the HR image y and the LR one x can be modeled by the observation model [67]:
[TABLE]
Here, we denote the matrix B a blurring operator, the matrix D a matrix representing the down-sampling operator, and the matrix v the additive Gaussian white noise. If we use the matrix H to denote the blurring and downsampling processes (the matrix H stands for the degradation operations), (3) can be rewritten as [1]:
[TABLE]
Since the matrix H has far fewer rows than columns, Eq. (4) is ill-posed and has an infinite number of solutions. Therefore, in order to recover a reasonable HR image, SR approaches typically try to find and model an appropriate prior knowledge of natural images. For example, gradient prior, self-similarity property (that some salient features repeat across different scales within an image), or the coupled LR/HR patches based algorithms have been used to effectively model the prior for building the inverse recovery mapping problem.
Developing sophisticated image priors has been the focus of much single image SR research in the past decade. In contrast, the reconstruction constraint, which states that the degenerated HR image should be equal to the LR observation one, has received relatively little attention. Some algorithms do not enforce x = Hy at all. The representative ANR [37], A+ [39], and recently proposed deep learning based methods [52, 53, 54, 55] all ignore this reconstruction constraint.
To this end, in our ensemble learning based SR framework, we introduce this reconstruction constraint to our objective function. Specially, we enforce the blurred and downsampled HR ensemble output should approximately equal the low-res input image. We assume that the difference between the ensemble HR output and the LR input image, i.e., the reconstruction error obeys the Gaussian distribution, thus the likelihood term can be written as follows
[TABLE]
where denotes the standard deviation of the noise.
III-C Prior Modeling of Ensemble Weights
The aforementioned reconstruction constraint can be seen as the specific regularization for the ensemble weights of an observed LR image. In this subsection, we propose to regularize the ensemble weights by defining another prior of the ensemble weights, thus overcoming the ill-posed solution of Eq. (5).
In practice, the performance of component super-resolver is unknown. However, we can get their SR results on a reference dataset, which can be used to approximate the performance. Specifically, we introduce an additional reference dataset, and then test the performance of component super-resolvers. Then, their reconstruction quality evaluations can be obtained by combining their performances at different magnification factors, e.g., 2, 3, and 4 in our experiments,
[TABLE]
We denote and the mean Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) [68] results of the -th component super-resolver at scale , respectively. It is worth mentioning that more measurements can be incorporated to obtain the performance score. Our basic assumption is that the method obtaining a better performance on the reference dataset should get a relatively larger weight when reconstructing the HR output image of an LR input one in the ensemble framework. Fig. 2 shows the process of obtaining the ensemble weight prior.
Therefore, given the performance of component super-resolvers on the reference dataset, we define the -th element of the reference weight vector as follows,
[TABLE]
where is the bandwidth parameter, and is the best performance of component super-resolvers, . The numerator represents the performance similarity between -th component super-resolver and the best method, while the denominator is normalization constant used to guarantee the sum of all element of to be one. The bandwidth parameter is crucial for the following SR task. Very large or small values will be detrimental to the final result. As shown in Fig. 3, when the value of is too small, the best component super-resolver will dominate the SR reconstruction, i.e., the weight of the best component super-resolver will be close to 1, while other component super-resolvers are almost 0. In contrast, when the value of is too large, all the component super-resolvers will contribute equally to the SR reconstruction, i.e., different component super-resolvers are assigned to the same weights. For more detailed analysis, please refer to the experimental section.
Note that denotes the prior weights learned from the reference dataset. Our aim is to obtain an input specific ensemble weight vector w that cannot differ too much from . Thus, we can define the prior probability of w by Gaussian model due to its simplicity:
[TABLE]
where is a scale parameter for the prior distribution of ensemble weights w.
III-D Objective Function
By substituting Eq. (5) and Eq. (7) into the Eq. (2) and dropping some constant terms, we have
[TABLE]
The first term is the reconstruction error, while the second is the difference between a pre-learned weight vector and the optimal weight vector to be estimated. The regularization parameter is related to and by , and is used to balance the contributions between the reconstruction error and the prior knowledge of w.
In order to make the ensemble SR results interpretable, we present to incorporate the sum-to-one constraint to the objective function. Thus, we have
[TABLE]
To obtain an optimal ensemble weight vector, we simultaneously take into consideration the input dependent reconstruction constraint and the prior of the ensemble methods learned from a reference dataset. The first term can be seen a global reconstruction constraint, which can guarantee the consistence between the degraded HR estimation and the input LR image. For these patch based SR methods [24, 37, 39], the averaged and fused HR estimation may not meet perfectly with the global reconstruction constraint [17, 70]. In other words, these patches based SR methods reconstruct the HR image locally (patchwise) and ignore the global information. Through adding this global reconstruction constraint, our method can guarantee the degraded HR image (Hy) is equal to the observed LR image (x), and thus capturing more information about the global structure of the target HR image. Therefore, the proposed ensemble model can avoid the problem of lack of flexibility due to the absence of data-based reconstruction constraints, or the problem of the solution is not unique due to ill-posed conditions.
III-E Optimization
For the blurring and downsampling processes are the liner operator, thus we have
[TABLE]
Each column of denotes one downsampled HR output, . By substituting Eq. (10) to Eq. (9), the objective function can be rewritten as the following matrix form,
[TABLE]
Eq. (11) can be written as,
[TABLE]
where and \bf{x^{{}^{\prime}}}={\left[\begin{array}[]{l}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\bf{x}}\\ \sqrt{\lambda}{{\bf{w}}^{ref}}\\ \end{array}\right]}, \bf{Y^{{}^{\prime}}}={\left[\begin{array}[]{l}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\bf{Y}}\\ \sqrt{\lambda}{\bf{I}}\\ \end{array}\right]}, and is a unit matrix with the size of .
Apparently, Eq. (12) is a constrained linear least squares problem. Following the work of [25], we first define a local Gram matrix G for ,
[TABLE]
where 1 is a column vector of ones. Then, the problem (12) has the following analytical solution:
[TABLE]
Upon acquiring the optimal ensemble weights of , we can just simply combine the results of component super-resolvers and through Eq. (1). It is worth noting that the objective functions of Cevikalp et al. [71] and our proposed method are essentially a constrained least squares problem, as proposed in [25]. The work of [71] tries to obtain the optimal combination weights of different classifiers to achieve the best classification performance, while our method focuses on the image SR problem, and tries to obtain the optimal combination (ensemble) weights with the global reconstruction constraint as well as the prior of the weight constraint. In this sense, they are different though they all use the same optimization method to solve their objective function. In fact, in the field of image processing and computer vision, the objective function of many methods is a very simple, i.e., a constrained least squares problem. The difference lies in that different methods use different constraints (prior knowledge) to regularize the solutions. How to find a good prior knowledge and how to model it effectively is the key to the success of an algorithm. The novelty of the proposed method is the introduction of a reference dataset and using it to produce prior knowledge to regularize the combination (ensemble) weights.
IV Experimental Results
In this section, we present the experimental settings used to evaluate the proposed RefESR approach and show the reconstruction results generated by carrying out SR experiments on three public general image databases and some face image databases.
IV-A Experimental Setup
Database. To test the performance, we leverage three commonly used image sets, SET5, SET14, and Urban100, as the testing images111SET14 includes 14 different scenes and was firstly used by Zeyde et al. [32] to show their results, SET5 includes 5 different scenes of image and was used by Bevilacqua et al. [26], and Urban100 is created by Huang et al. [53] and contains 100 HR images with a variety of real-world structures, such as urban, city, and architecture. The length and width of original HR images for SET14 (the first column), SET5 (the second column), and Urban100 (the last two columns) databases, are all from 200 pixels to 600 pixels.. Like many state-of-the-art single image SR methods [53, 39, 69, 52, 58, 54, 55], in our experiments the original HR images are degenerated by Bicubic interpolation (i.e., the imresize function in Matlab) with a factor of 2, 3, and 4, to generate the corresponding LR images. It should be noted that if the image degradation process of the input LR image is unknown, which can be seen as the blind image SR, the performance of our method will reduce sharply because the mismatch between the true image degradation and simulated image degradation of the training dataset [72].
Note that there are some contextual connections between the images in the reference set to image in the test set. This has been confirmed by many domain-specific image SR methods, i.e., face hallucination and text SR. When super-resolving LR face images, a good general image SR method which is trained by diversity general images is usually worse than a domain-specific face image SR method which is trained by face images. In this paper, we consider only the general image SR problem, so we hope that the reference dataset should be as diversity as possible.
Implementation Details. To ensemble different component super-resolvers, we first select some state-of-the-art SR algorithms, which include four non- deep learning, e.g., Kim [31], SelfExSR [53], A+ [39], and IA [69], and five deep learning based methods, e.g., SRCNN [52], CSCN [58], CSCN-MV [58], VDSR [54], and DRCN [55].222We select these nine methods for their representative, pleasurable performance, and also public availability of their source codes. Then we test their performance on the reference dataset. Because we know the ground truth of the input LR image, the SR abilities of these algorithms can be measured by some objective metrics, such as PSNR, SSIM, or their combination. And then, the reference weight vector (calculated by Eq. (6)) is applied to regularize the optimization of the ensemble weights.
In the testing phase, we first reconstruct the HR images of above-mentioned component super-resolvers. And then, the optimal ensemble weights is obtained by Eq. (14). The final HR output can be constructed by the combination of the HR resultant images of different component super-resolvers and the optimal ensemble weights.
IV-B Parameter Analysis
In this subsection, we analyze the effect of model parameters for the performance of RefESR, and validate the proposed reconstruction constraint and reference ensemble weight prior used in the proposed network. Particularly, we conduct experiments on the testing image set of SET14 and the magnification is 3. For other cases, we can still draw a similar conclusion. Therefore, here we will not show up one by one. From the objective function (9) of our method, we learn that the bandwidth parameter and the regularization parameter have a great impact on the performance of the algorithm.
Fig. 4 and Fig. 5 show the performance of our method when the other parameter is set to the optimal. As shown in Fig. 4, we can at least draw the following two conclusions: (i) The ensemble SR reconstruction is effective. This can be concluded by comparing the results when and . When , almost only the best component super-resolver is active (the 8-th method, i.e., VDSR [54]. Please refer to the top-left of Fig. 3), while when , only three component super-resolvers (the 4-th, 8-th, and 9-th methods, i.e., IA [69], VDSR [54], and DRCN [55]) contribute to the final result. (ii) The prior knowledge learned from the reference dataset is effective. This can be concluded by comparing the results when and . When , all the component super-resolvers will be treated equally, i.e., the ensemble weights are set to the same value (please refer to the bottom-right of Fig. 3 )), the performance is worse. This can be illustrated by that the poor component super-resolver with unreasonable reconstruction of the results will pull down the overall reconstruction performance.
From Fig. 5, we can learn that the performance increases with the increase of the value of , and then slightly decrease. This indicates that the prior knowledge of the reference ensemble weights is very effective for the SR reconstruction. When , it reduces to the case of considering only the reconstruction constraint. There is 0.2 dB gain of the proposed method over the method neglecting the prior knowledge of the reference ensemble weights. The decrease after is because of overemphasizing the prior knowledge of the reference ensemble weights while neglecting the reconstruction constraint. This verifies our motivation of simultaneously taking into consideration of the reconstruction constraint (favors the degenerate model) and the prior knowledge generated from the reference dataset.
For the sake of convenience comparisons, in Table IV we tabulate the performance of above-mentioned cases: RefESR without reconstruction constraint, RefESR without weights prior, ensemble via averaging, and the proposed RefESR method. In the second row, we also list the performance of the best component super-resolver, i.e., VDSR [54]. The two cases of introducing the reconstruction constraint and averaging based ensemble obtain the similar results, which is consistent with Wang et al.’s results (see the Table 4 in [66]). It also shows that it is not enough to consider reconstruction constraint alone. When compared with RefESR without reconstruction constraint and RefESR without weights prior, it indicates that the ensemble weight prior is effective and relatively more important than the reconstruction constraint. This is mainly because that the component super-resolvers used in our experiments are very competitive and have very good SR performance, and these methods essentially satisfy the reconstruction constraint. By incorporating the prior knowledge of ensemble weights, our method has a quite impressive gain, i.e., 0.2 dB. For image SR is a very hot topic and becomes a test bed for many emerging models and algorithms, and some very superior methods are constantly being presented, and thus it is very difficult for one new method to obtain a very large gain over previous methods. From Table IV, we can also see that simply averaging all the results of different methods will sacrifice the final ensemble performance, e.g., 0.07 dB decrease when compared with VDSR [54]. This once again shows the effectiveness of adaptively assigning different ensemble weights to different component super-resolvers.
Under above optimal parameter settings, and , we examine the final ensemble weights of different testing images on the SET14. As shown in Fig. 6, three best component super-resolvers, IA [69], VDSR [54], and DRCN [55], dominate the SR reconstruction. The better the quality of SR performance over the reference dataset is, the larger the ensemble weight is. This verifies our assumption that better performance on the reference dataset should get a relatively larger weight when reconstructing the HR output image of an LR input one in the ensemble framework. Moreover, from the results we also can see that some component super-resolvers with low quality do not contribute substantially to the results. When we only select three methods that play dominant roles (i.e., the ensemble weights of these methods are relatively large), we find that this has little impact on the final performance of the proposed algorithm. This is consistent with the observation that it may be better to combine some instead of all of the component super-resolvers.
IV-C Compare with State-of-the-art
To verify the effectiveness of the proposed RefESR method, we provide quantitative and qualitative comparisons with the eight component super-resolvers and Wang et al.’s ESCN method [66] over SET5, SET14, and Urban100 for different upscaling factors. We add the visual results of Bicubic interpolation, which can be seen as the baseline. In Table I, Table II, and Table III, we show the PSNR and SSIM for adjusted anchored ten comparison methods and our RefESR method. All the values in tables are the average over all the images within a dataset. From the results , we can learn that our method outperforms almost all existing methods, including the most competitive deep learning based methods, in all datasets and scale factors (in term of PSNR). Only in two situations our RefESR method is just a little worse than DRCN [55] (in term of SSIM). The visual comparisons of three typical images are shown in Fig. 7, Fig. 8, and Fig. 9. To make the comparison more notable, we also give the local region (marked by red boxes) magnification results. Our method produces relatively shaper boundaries and is free of the ringing artifacts.This can be explained as the following two reasons: (i) Through the ensemble strategy, it is possible to highlight the good side of these approaches with superior performance while inhibiting the poor side these approaches with poor performance. (ii). The reconstruction artifacts, ringing artifacts, of one component super-resolver can be weakened by fusing multiple results. But we must also see that if all methods produce ringing artifacts in the same region, the ensemble results cannot eliminate these artifacts.
Image SR is a very hot topic and becomes a test bed for many emerging models and algorithms, especially recently very popular deep learning techniques. Almost every few days there will be a new algorithm is released in arXiv. In the process of preparing this paper, a series of deep learning based SR algorithms are released and achieve very good performance. To further demonstrate the effectiveness of the proposed ensemble learning framework, we additionally ensemble the most competitive method, EDSR [73], with aforementioned nine component super-resolvers. Table V shows the results of EDSR and the proposed RefESR. In addition, we also give the results of the combination of geometric ensemble strategy and the proposed ensemble strategy, which is denoted as RefE2SR. From these results, we observe that: (i) Although the performance of EDSR is already very good, the proposed ensemble framework can still improve the final results. It shows that EDSR and other methods still have a certain degree of complementarity. (ii) RefE2SR is better than RefESR. This can be explained by the following reasons: when geometric ensemble strategy is applied to the component super-resolvers, their performance can be improved. With these improved SR results, our proposed method can further promote the overall performance of the combination ensemble strategy.
IV-D Ensemble Super-Resolution Results with Face Images
In order to verify the universality of our proposed ensemble framework, we test our the proposed RefESR method on the task of face image SR, a.k.a. face hallucination [79]. Similarly, through a reference set, the performance of different face SR algorithms is learned, i.e., their ensemble weights are estimated, and then the reconstruction results of different algorithms on the newly observed LR faces are integrated based on the estimated weights.
The reference face dataset consists of 600 images of 600 subjects, in which 200 subjects are from CAS-PEAL-R1 face database [80], 100 subjects are from CUHK face database [81], 200 subjects are from COX-S2V face database [82], and 100 subjects are from Scface face database [83]. To evaluate the performance of the component super-resolvers, we additionally collect 20, 10, 20, and 10 face images from these four databases to form the evaluation dataset. For testing, we capture 42 High-Definition (HD) images, whos face images are very different from the face image in reference face dataset. Some example images are shown in Fig. 10. In our experiments, the component super-resolvers for face images include Wang et al.’s Eigentrasformation method [74], neighbor embedding (NE) [24], least squares representation (LSR) [75], sparse representation (SR) [17], locality-constrained representation (LcR) [76], smooth sparse representation (SSR) [77], dual regularization prior (DRP) [78].
Similarly, we apply the reference face dataset to train the component super-resolvers and use the evaluation dataset to obtain their performance in terms of PSNR and SSIM. Therefore, the the reference weight vector can be calculated according to Eq. (6). Based on the prior knowledge of w, we can obtain the optimal ensemble weight vector for each input LR face image by 9. In addition, we also conduct some experiments to test the robustness of our method when the input is contaminated by noise. Our first impression is: given the noise input, if the resulting images generated by different algorithms are not optimal (may contain noise), then the noise can be smoothed through fusion of different results.
Table LABEL:tab:face tabulates the performance (in terms of average PSNR and SSIM) of different component super-resolvers and the proposed RefESR method under different noise levels, i.e., . We learn that RefESR achieves the best average PSNR and SSIM results. The gains of the proposed method over the second best method are obvious, greater than 0.5 dB in term of PSNR. In addition, we also observe that with the increase of noise, the advantage of the proposed method is much more obvious. In particular, when the input is noiseless, the PSNR gain of the proposed method over the second best method is 0.53 dB. When the input is contaminated by different levels of noise, the gain is 0.58 dB for and 0.69 dB for , respectively. We attribute this to the advantages of ensemble learning, which can eliminate the uncertainties caused by noise in different methods. Fig. 11 shows some visual comparison results of component super-resolvers and the proposed method. From these results, we observe that the proposed RefESR method can remove most of the noise and well maintain the main structural information.
V Discussion
In this section, we show deep analysis to the proposed ensemble learning framework, so that readers can better capture our idea.
Time complexity. By ensembling the results of some state-of-the-art methods, we can expect better reconstruction performance. This will also result in very high computational complexity. Despite the efficient solution of the optimization procedures of ESR, which take around 0.06 seconds for each image, the computational complexity of our method is high because the total running time is the sum of (i) all component super-resolvers and (ii) the optimization procedures of ESR. Therefore, the computational complexity will be a bottleneck for our approach in practical applications.
Theoretical guarantee. Another drawback of the proposed algorithm is that there is no theoretical guarantee to produce a better result by ensembling different methods, which is also the limitation of conventional ensemble learning based machine learning methods [64]. From the experiments, we learn that in most cases our RefESR method beats all the comparison methods. However, under some situations, our RefESR method is worse than the best comparison method. Therefore, in the future we will consider the learning of a safe prediction from multiple component super-resolvers, which is not worse than the performance of all component super-resolvers.
Model universality. Different methods can adapt to different kind of test images. For example, there are SR algorithms for general images and SR algorithms for specific images such as digital characters, faces, and irises. SR models trained on general images are not suitable for reconstruction of specific images, and vice versa. Furthermore, the ensemble weight prior (of ensemble learning) obtained from the general images of different methods may not necessarily reflect the SR ability on specific images. In this paper, the proposed ensemble learning based SR method is applied to the general images and face images SR tasks. Through the experiments, we believe that the proposed method is indeed effective in the sense of improving the performance of the existing generic image SR algorithms or face image SR algorithms. In summary, the proposed framework is very universal in the sense that given a reference dataset, the proposed method can improve the performance of existing SR methods when the input image is with the same class of the reference dataset.
Choice of component super-resolvers. In this paper, we do not consider the complementarity of different methods, but directly select several representative methods in the current SR field, including four shadow learning-based methods and five deep learning-based methods. We also believe that when choosing component super-resolvers, it should consider the characteristics of different algorithms. Ensembling component super-resolvers with different characteristics is more likely to improve the final ensemble performance.
Global reconstruction constraint. As shown in many previous works [84, 32, 17], global reconstruction constraint, which claims that the degenerated HR estimation should be consistent with the observed LR image [17, 70], is very effective for enhancing the final super-resolved results by an iterative back projection strategy. In our experiments, we have found that if the performance of the component super-resolver is good enough, the improvement brought by reconstruction constraint is very limited. In other words, when the component super-resolver is good enough, it can basically meet the reconstruction constraint.
VI Conclusion
In this paper, we present a novel framework based on ensemble learning to solve the single image SR problem. It introduces a reference dataset and incorporates the learned prior of each component super-resolver, which states that the method obtains a better performance on the reference dataset should get a relatively larger weight when reconstructing the HR output image of an LR input one in the ensemble framework, to regularize the optimization of ensemble weights. We simultaneously model this learned prior of ensemble weights and reconstruction constraint, which states that the degenerated HR image should be equal to the LR observation one, by an MAP formulation. Finally, we present an analytical solution to this constrained least squares problem induced from the MAP framework. Results show the effectiveness of the introduced prior knowledge of ensemble weights learned from a reference dataset.
Acknowledgment
We would like to thank Dr. Zehao Huang, the author of [66], for his kind providing of the results of ESCN algorithm. We also would like to thank Dr. Jingang Shi, the author of [78], for his kind providing of the source codes of DRP based face super-resolution approach.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. C. Park, M. K. Park, and M. G. Kang, “Super-resolution image reconstruction: a technical overview,” IEEE Signal Processing Magazine , vol. 20, no. 3, pp. 21–36, 2003.
- 2[2] N. Wang, D. Tao, X. Gao, X. Li, and J. Li, “A comprehensive survey to face hallucination,” Int. J. Comput. Vis. , vol. 106, no. 1, pp. 9–30, 2014.
- 3[3] X. Liu, D. Zhai, R. Chen, X. Ji, D. Zhao, and W. Gao, “Depth super-resolution via joint color-guided internal and external regularizations,” IEEE Trans. Image Process. , vol. 28, no. 4, pp. 1636–1645, 2019.
- 4[4] K. Jiang, Z. Wang, P. Yi, J. Jiang, J. Xiao, and Y. Yao, “Deep distillation recursive network for remote sensing imagery super-resolution,” Remote Sensing , vol. 10, no. 11, p. 1700, 2018.
- 5[5] H. A. Aly and E. Dubois, “Image up-sampling using total-variation regularization with a new observation model,” IEEE Trans. Image Process. , vol. 14, no. 10, pp. 1647–1659, 2005.
- 6[6] X. Liu, D. Zhai, D. Zhao, G. Zhai, and W. Gao, “Progressive image denoising through hybrid graph laplacian regularization: A unified framework,” IEEE Trans. Image Process. , vol. 23, no. 4, pp. 1491–1503, April 2014.
- 7[7] X. Liu, G. Cheung, X. Wu, and D. Zhao, “Random walk graph laplacian-based smoothness prior for soft decoding of JPEG images,” IEEE Trans. Image Process. , vol. 26, no. 2, pp. 509–524, Feb 2017.
- 8[8] J. Sun, J. Sun, Z. Xu, and H.-Y. Shum, “Image super-resolution using gradient profile prior,” in CVPR . IEEE, 2008, pp. 1–8.
