Suppressing Model Overfitting for Image Super-Resolution Networks
Ruicheng Feng, Jinjin Gu, Yu Qiao, Chao Dong

TL;DR
This paper introduces MixUp and a learned degradation data synthesis method to effectively reduce overfitting in large image super-resolution models trained on limited data, improving generalization.
Contribution
It proposes a novel combination of MixUp and synthetic data generation with learned degradation to suppress overfitting in large super-resolution networks.
Findings
Achieved second place in NTIRE2019 Real SR Challenge.
Effectively reduces overfitting with limited training data.
Enhances model generalization in real-world scenarios.
Abstract
Large deep networks have demonstrated competitive performance in single image super-resolution (SISR), with a huge volume of data involved. However, in real-world scenarios, due to the limited accessible training pairs, large models exhibit undesirable behaviors such as overfitting and memorization. To suppress model overfitting and further enjoy the merits of large model capacity, we thoroughly investigate generic approaches for supplying additional training data pairs. In particular, we introduce a simple learning principle MixUp to train networks on interpolations of sample pairs, which encourages networks to support linear behavior in-between training samples. In addition, we propose a data synthesis method with learned degradation, enabling models to use extra high-quality images with higher content diversity. This strategy proves to be successful in reducing biases of data. By…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMixup
Suppressing Model Overfitting for Image Super-Resolution Networks
Ruicheng Feng1, Jinjin Gu2, Yu Qiao1,3, Chao Dong1
1ShenZhen Key Lab of Computer Vision and Pattern Recognition, SIAT-SenseTime Joint Lab,
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
2The School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen
3The Chinese University of Hong Kong
{rc.feng, yu.qiao, chao.dong}@siat.ac.cn, [email protected]
Abstract
Large deep networks have demonstrated competitive performance in single image super-resolution (SISR), with a huge volume of data involved. However, in real-world scenarios, due to the limited accessible training pairs, large models exhibit undesirable behaviors such as overfitting and memorization. To suppress model overfitting and further enjoy the merits of large model capacity, we thoroughly investigate generic approaches for supplying additional training data pairs. In particular, we introduce a simple learning principle MixUp [43] to train networks on interpolations of sample pairs, which encourages networks to support linear behavior in-between training samples. In addition, we propose a data synthesis method with learned degradation, enabling models to use extra high-quality images with higher content diversity. This strategy proves to be successful in reducing biases of data. By combining these components – MixUp and synthetic training data, large models can be trained without overfitting under very limited data samples and achieve satisfactory generalization performance. Our method won the second place in NTIRE2019 Real SR Challenge.
1 Introduction
Since the seminal work of employing convolution neural networks (CNNs) for single image super-resolution (SISR) [11, 12], a constantly growing flow of deep learning based methods with different network architectures [13, 21, 24, 22, 38, 18, 46, 45, 3] and training strategies [41, 35, 5, 16, 31] have been proposed to achieve substantial progress in state-of-the-art performance. These methods are usually trained and tested using thousands of high-quality images. Therefore, overfitting is rarely observed when training models with such abundant image pairs. These image pairs are usually generated by pre-defined downsampling methods, such as bicubic. Beyond those pre-defined degraders, in the recent work [7, 6, 44] real captured low-high resolution image pairs are used to train SR models under realistic application settings. However, the amount of such data is often limited (e.g., only image pairs in NTIRE19 Real SR Challenge [1]) because of the high cost of collection and preprocessing of data. This leads to severe overfitting problem for recent deep SR networks. Specifically, the network tends to memorize the training images and generalizes poorly to the test set. For instance, as shown in Figure 1, large models trained on a small dataset quickly deteriorate their generalization performance (see the lower curves). The overfitting problem has largely limited the usage of the advanced SR methods in real-world applications.
As an important issue, overfitting has attracted increasingly research interests in high-level vision tasks, such as image classification [10, 15, 20, 8, 40], visual tracking [9, 14], etc. However, overfitting in low-level tasks has received relatively less attention. Due to the different characteristics of low-/high-level tasks, most existing methods that are suitable for high-level tasks cannot be directly applied to low-level tasks. For example, some network regularization methods, such as weight decay and dropout, do not work effectively for low-level networks. In addition, some popular data augmentation techniques such as label smoothing are also infeasible for low-level tasks as they only work with one-hot labels. In low-level vision community, only limited augmentation methods (e.g., random crop, rotation and flipping) are investigated, which is far from sufficiency for real-world applications.
In this paper, we study the overfitting problem for SR. First, we adopt a simple yet effective data augmentation method called MixUp [43] in SR. MixUp uses convex combinations of samples rather than samples themselves to train the SR model. It normalizes neural networks to support simple linear behavior in-between training samples, and leads to better generalization performance (see orange curve in Figure 1). Second, we propose a data synthesis approach with a learned degradation mapping. Concretely, we use deep networks to learn the degradation mapping first, and synthesize new training samples using extra high-quality images. This synthesis strategy reduces the bias of the data by introducing content diversity into the training set (see green curve in Figure 1). The SR models trained with the synthetic data are expected to provide better generalization performance on image contents that do not exist in the original small dataset. By combining the above components – MixUp and synthetic training data, we are able to suppress model overfitting in SR under very limited training samples. Extensive experiments show that either MixUp, data synthesis, or both can suppress model overfitting and encourage better generalization (see upper curves in Figure 1).
We summarize our contributions as follows: (1) We introduce the MixUp technique into SR for data augmentation. Experiments demonstrate that MixUp could significantly reduce the overfitting problem. (2) We propose a new data synthesis method to suppress model overfitting in SR. It uses the learned degradation mapping to synthesize more training pairs with additional high-quality images. (3) With the proposed data augmentation and data synthesis methods, we construct a network of a general U-Net shape [33] which encourages better generalization ability and achieves satisfactory performance without overfitting. Our method won the second place in NTIRE 2019 Real SR Challenge.
2 Related Work
Image super-resolution Recently, learning-based methods have achieved dramatic advantages against the model based methods. With the seminal exploration of employing deep learning in SR task [11, 12], the variational approaches with deep neural networks have been dominated single image SR. Dong et al. [13] propose to use a deeper network with low-resolution image as input to learn the SR mapping. Kim et al. [21] propose VDSR – a very deep network with residual learning and show the performance improvement by using deep networks. Ledig et al. [25] introduce residual blocks into SR network and propose SRResNet, which makes it possible to train deeper networks. Lim et al. [26] further expand the network size and improve the residual block by removing the Batch Normalization Layers. Zhang et al. [45] propose a deep network with dense connection and Wang et al. [41] propose to use residual in residual dense block to improve the training stability and network size. Zhang et al. [46] propose residual channel attention blocks and indicate that deeper networks may be easier to achieve better performance than wider networks. As can be seen, most recently successful SR methods employ very deep networks with a large number of parameters, which leads to a high risk of overfitting.
Data augmentation. The method of choice to train on similar but different examples to the training data is known as data augmentation [36]. The most common methods of data augmentation include some basic image processing operations, e.g., random scale, random crop, horizontal/vertical flip and image affine transformation. In addition to the basic image processing operations, Zhong et al. [47] propose to augment data by randomly erasing part of the image. Inoue [20] propose to synthesize a new sample from one image by overlaying another image randomly chosen from the training data. Zhang et al. [43] propose to synthesize new samples using the linear combination of training samples. DeVries et al. [10] improves regularization of networks by masking out square region of training images. Geirhos et al. [15] reduces bias toward textures by introducing stylized image data for training. Cubuk et al. [8] presents AutoAugment to learn the best augmentation policies from data. Besides, Generative adversarial networks (GANs) have also been used for the purpose of generating additional data [29, 27, 48, 4, 37, 32]. Most of the existing data augmentation methods are proposed and studied for high-level tasks, and there exists few work to study the effects of different data augmentation methods on the low-level task such as SR.
NTIRE 2019 Real Super-Resolution Challenge. This work is initially developed to participate in the NTIRE2019 Real Super-Resolution Challenge [1]. The challenge aims to offer an opportunity for academic and industrial attendees to focus on Super-Resolution applications in real-world scenario. In the challenge, a novel dataset of LR real images with HR real references, where the sizes of LR images are same as its HR counterparts, is provided to challenge participants. These images are collected in natural environments, including indoor and outdoor environments. Different from most SISR tasks [12, 26] using pre-defined degraders, images from this dataset are captured by DSLR cameras, and therefore facilitate researches for real-world applications.
However, due to the small volume of data pairs, models suffer from severe overfitting problem. Hence, mechanisms for training large models without overfitting are required to deal with this challenge. We submitted our models and prove that our method are able to suppress model overfitting in SR. Our methods successfully reconstruct HR images from severely degraded real LR images without unpleasant artifacts related to overfitting. Our approach won the second place in the challenge.
3 Methodology
In this section we show the overfitting problem in SR and present our proposed methods. The rest of this section is organized as follows: Sec. 3.1 describes how SR networks overfit on training dataset from NTIRE 2019 Real SR Challenge. Then, we formulate the overfitting issue and data augmentation. Later, Sec. 3.3 and 3.4 introduce the data augmentation method with MixUp and the data synthesis method with learned degradation, respectively. Finally, in Sec. 3.5 we illustrate the network architecture.
3.1 Overfitting in Super Resolution
In this challenge, a new dataset of real LR and HR paired images (RealSR), with the spatial resolution no smaller than , is publicly available. This dataset contains only images for training (See Sec. 4.2 for details). Due to the limited diversity and amount of training data, large models exhibit undesirable overfitting behaviors even when using straightforward data augmentation techniques (e.g. random crop, rotation, flipping). For instance, a well-trained model poorly generalize to the test set and tends to generate unpleasant artifacts on test images.
To start off with right intuitions, Figure 3 illustrates the impact of data volume and model complexity evaluated on the validation set. The validation set consists of images covering contents that do not exist in the training set. In the first setting, we construct a sufficiently large network (with M parameters) and train the network with different sizes of data, starting with the first sub-images (from about images) and increasing gradually to all sub-images (cover images). In Figure 2(a), we can observe that while all models quickly overfit to training set, increasing amounts of training data will lead to better performance in the training phase. In another setting, we use the whole training set to train models with different sizes, ranging from M to M. Figure 2(b) shows that larger models do not necessarily achieve higher PSNR values at the early stage and suffer from severe overfitting if training continues. In contrast, the overfitting problem on small models becomes less severe. This example conveys the central message: overfitting in SR is partially due to the mismatch between data volume and model complexity. To enjoy the merits of large model, we present two methods to remedy such a discrepancy by supplying additional training pairs.
3.2 Problem Formulation
To facilitate the discussion, we first formulate the overfitting problem and data augmentation. Let , be the LR images and their HR counterparts on the true data space, where true data refer to image pairs with the desired degradation function, which can be either pre-defined kernels or unknown real degradations. For each , we have , where is the degradation function mapping onto . In SISR task, given an observation set as the training set, our goal is to find an inverse mapping function by optimizing a well-defined loss function
[TABLE]
The major risk of this framework is that may be biased, leading to poor generalization ability on unobserved data points. This problem is severe especially when observations are insufficient to cover the true data manifold.
The most widely-used technique to reduce such a risk is data augmentation. Specifically, in the perspective of data augmentation, an addition set , which is beyond the training set but believed inside the true data manifold , are introduced for training. In SISR, can be obtained by rotating each data pair in . We hypothesize that for each , we have , indicating that data pairs in observation set and those in augmentation set follow the same degradation mapping.
3.3 Data Augmentation with MixUp
We consider a simple yet effective data augmentation method, MixUp [43]. In MixUp, each time we randomly sample two samples and in the set . Then we form a new sample by a linear interpolation of these two samples:
[TABLE]
where is a random number drawn from a beta distribution .
In super resolution, we can assume that the degradation function is a linear mapping, which can be formulated as , where is the downsampling matrix and is the noise. If and are determinded, we have
[TABLE]
where . is the noise and drawn from the same distribution of . This property also holds when is signal-dependent. This indicates that although the MixUp-augmented data pairs have unnatural visual effects, they follow the same degradation model with the true data and can be used to learn the inverse mapping .
Moreover, MixUp provides a linear neighbourhood of real data, making the learned inverse mapping more robust. With MixUp, we can easily obtain multiple times of data pairs to train the network. As illustrated in Figure 3(b), the observation set is a subset of MixUp-augmented dataset and the latter on has greater cardinality.
Experiments in Sec. 4.3 show that this simple augmentation method can simultaneously suppress overfitting and improve performance.
3.4 Data Synthesis with Learned Degradation
Beyond MixUp, we also investigate another strategy to provide more training examples – data synthesis via learning degradation process. As depicted in Figure 4, given an observation set comprising images with finite content diversity, there might be a risk of biased sampling from the true data distribution. Formally, let and be the observed and true data distribution, respectively. For some training pairs with biased sampling, could diverge far from . In the extreme, suppose that there is an imbalanced training set with purely text images, then it is unlikely for models trained with such a dataset to generalize well on other contents (e.g., human face, natural scenery, animal, etc.). In practice, a small set is usually both imbalanced and noisy, which increase the risk of overfitting.
To bridge the gap between and , we propose a data synthesis technique to provide training pairs with higher diversity. As illustrated in Figure 4, given a high-quality diverse HR dataset (e.g. DIV2K [2], Flickr2K [39], etc.) as , the corresponding LR image set is not accessible since the true degradation is unknown. Due to nuisance factors, including blur (e.g. motion or defocus), compression artifacts, color and sensor noise, etc., it is usually impractical to effectively model the true image degradation in real-world scenarios. Rather than managing to model a complicated image degradation process, we propose to use a neural network model denoted as to learn the degradation on finite observation set .
With well-optimized , we can obtain estimated LR images , where for each we have for . As is an approximation of , we expect that for each , the LR counterpart and should not diverge too far. We will refer to set as the synthetic dataset. With extra data pairs, we turns Eqn. 1 into
[TABLE]
During training the SR network , we treat the synthetic data as additional training data and mix them with the original real data. Both networks and have the same architecture (see Sec. 3.5). The main difference is that takes the HR image as input and generate its LR counterpart, while is modeling an inverse mapping. The overall pipeline is shown in Figure 4.
This approach is mainly inspired by Back-Translation [34, 30] in Neural Machine Translation. In the context of super resolution, [5] proposes to use a GAN to stimulate image degradation and shares a similar motivation. The fundamental differences between this paper and [5] are two-fold: 1) we do not add any generative adversarial component into our PSNR-oriented models; 2) we train both networks with paired image data.
3.5 Network Architecture
As illustrated in Figure 5, the proposed network has a U-Net structure and consists of cascading blocks, each of which has Residual Channel Attention Blocks (RCABs). The spatial resolution of features is decreased times using convolution layers with stride , and then it is increased twice via pixel shuffle layers. The basic building block is RCAB proposed in RCAN [45], and the main difference between our model and RCAN is the global network topology. Specifically, motivated by CARN [3], we use both local and global cascading modules to fully utilize hierarchical feature information derived from multiple blocks. The outputs of RCAB are cascaded into higher layers, followed by a single convolution layer, all of which serve as cascading blocks. Similarly, global cascading modules have the same topology, where the unit blocks are replaced by cascading blocks. To reduce computational cost, the main branch network works at resolution.
4 Experiments
4.1 Technical Details
For all experiments, we implement our models with the PyTorch [28] framework and train them using NVIDIA Titan Xp GPUs. The mini-batch size is set to 16 and the spatial size of cropped patch is . For initialization, the weights are randomly drawn from zero-mean Gaussian distributions as described in [19]. For optimization, we use Adam [23] with , and . The learning rate is initialized as and then decayed by half every iterations. We train all models for a total of iterations. We use loss instead of as suggested in [26]. We empirically set for MixUp. The SR results are evaluated on PSNR and SSIM [42] on RGB space. For all convergence curves plotted in this paper, we calculate the average PSNR value on the central patch of each image in validation set.
4.2 Dataset
We mainly train our models on the new Real-SR dataset, denoted as RealSR dataset below. The default splits of RealSR dataset consist of training images, validation images and test images. Evaluation of the trained models is performed on validation images since test images are not publicly available. As described in Sec. 3.4, we also include a prevalent DIV2K dataset [2] as additional training data, since these images cover diverse contents, including objects, environments, animals, natural scenery, etc. Following [26], we use training images as training set.
To prepare training data, we first crop the HR images into a set of sub-images with a stride for DIV2K dataset. Similarly, we crop HR images into sub-images of size and stride for RealSR dataset. In this manner we have totally and sub-images from RealSR and DIV2K dataset, respectively. To fully utilize the dataset, training images are augmented with random horizontal/vertical flips and rotations. During training, a patch of size is randomly cropped from a sub-image.
4.3 Experiments on MixUp
In this section we study the effect of MixUp on different types of dataset. Different from Sec. 4.4, we only use sub-images from RealSR dataset as training set. As described in Sec. 3.3, MixUp serves as a regularization on data manifold. To verify the effectiveness of this regularization on various types of degradation, we study three settings by generating LR from HR images as follows:
- •
Real LR images from RealSR training set
- •
Bicubic downsample HR images with a factor and then upsample to the original resolution.
- •
Bicubic downsample HR images with a factor and then upsample to the original resolution, with realistic noise [17] added to LR images.
Similarly, the corresponding validation set is constructed in the same manner for each setting. We denote the LR images as , and , which have the same ground truth . On three datasets we train models with and without MixUp to investigate effects of MixUp.
It can be observed from Figure 6 that after the first learning rate decay (K), models trained without MixUp quickly deteriorate their validation performance due to overfitting, while those with MixUp keep the same validation accuracy until termination. In super-resolution task, MixUp significantly reduces overfitting and guarantees robust training.
4.4 Experiments on Data Synthesis
In the scope of this section, we mainly use sub-image pairs from RealSR dataset as the observation set and HR sub-images from DIV2K dataset for data synthesis. We first train the degradation model with training sub-image pairs and the training settings are same as those for . The model converges at K iterations. We aim to provide a systematic analysis of SR networks trained on different synthetic dataset to build a clearer picture about the progressive effects of incremental amounts of synthetic data to the generalization ability.
To validate the assumption that the observation set is biased sampled, we evaluate how the validation error varies while increasing volumes of synthetic data (i.e., higher diversity). Specifically, models are built using a base observation set combined with the augmentation set that starts with [math] sub-image and grows incrementally to all sub-images. Note that the experimental settings degenerate to a baseline scenario without any regularization when contains no sub-image.
According to the results shown in Figure 7, the benefits of adding synthetic data are delaying and reducing overfitting on training set. As expected, adding more and more synthetic data to the training set encourages better generalization. The best combination comprises sub-images ( from and from ), which achieves a PSNR of dB, dB better than the baseline model.
4.5 Comparison with the State-of-the-arts
To further investigate overfitting on limited data, we include both light-weight networks (e.g., FSRCNN [13], CARN [3]) and larger networks (e.g., RCAN [45], RRDB [41]) in our comparison. We reimplement these state-of-the-art methods on RealSR dataset. Note that most of the existing methods operate at low resolution and upsample feature maps at the very end of the networks. Therefore, we simply modify the models by downsampling LR images with a stride in the first convolution layer, which is consistent with our U-Net architecture. Throughout experiments, we find existing large models can easily overfit to the training set, and therefore we study early stopped versions of those models to provide a stronger comparison. In contrast, early stopping is not necessary for light-weight networks and our method. We stress that early stopping strategy does not solve the overfitting problem (see also Sec. 3.1), as both training error and validation error are high. With early stopping, a large model will underfit and fail to make full use of model capacity. Specifically, an early stopped large model tends to restore blurry images while a overfitted version generates sharp images with unpleasant artifacts. Following [26], self-ensemble strategy is also applied to further improve generalization performance and the self-ensemble version is denoted with “*”.
Table 1 lists the quantitative results (PSNR / SSIM) on RealSR validation set. These results provide two insights: (1) both MixUp and data synthesis can significantly suppress overfitting on limited training data. (2) MixUp and data synthesis are not mutually exclusive, as one can additionally apply MixUp technique on the additional synthetic data to further improve the final performance.
In Figure 9, we show visual comparisons on state-of-the-art networks and our model. For image “cam2_08”, we observe that most of the compared methods cannot recover the lines of text and would suffer from blurring artifacts. In contrast, our model can alleviate the blurring artifacts better and recover more details. Similar observations are shown in images “cam2_07” and “cam1_06”.
5 Discussion
In this section we further discuss the effectiveness of data synthesis. With a sufficiently large dataset comprising high-quality HR images, one question remains unanswered is how the quality of generated LR images affects generalization ability. Our investigation involves applying various degradation types to HR images from DIV2K training set, while RealSR dataset remains unchanged. LR images are produced with three different degradation processes:
- •
Add White Gaussian noise with to HR images.
- •
Bicubic downsample HR images with a factor and then upsample to the original resolution.
- •
Construct a network to learn degradation.
The corresponding data pairs constitute a synthetic dataset, where we will refer to these augmentation set as , and . Convergence curves of models trained on different types of augmentation set are shown in Figure 8. We see that the use of synthetic data essentially reduce overfitting problem, compared with the baseline. In addition, LR images from , and are completely different from each other. The best generalization is reached by the model trained with , indicating that the learned mapping function among the investigated degradation types would be the most “similar” to the unknown true degradation . One can also investigate the sensitivity of SR networks to different kinds of degradation models, which will be left to our future work.
6 Conclusion
In this paper, we propose two simple yet effective methods to reduce overfitting problem in SR networks. Our method won the second place in NTIRE2019 Real SR Challenge. Particularly, we introduce MixUp technique to encourage networks trained with limited data to generalize well. In addition, data synthesis with learned degradation are employed to train models using extra high-quality images with higher content diversity. This strategy proves to be successful in reducing biases of data. By combining both techniques, large models can be trained without overfitting and achieve satisfactory generalization performance. Since the proposed approach is network-independent, it is expected to be easily applied to other network architectures and image restoration tasks. Future work will explore the effectiveness of our approach in more settings.
Acknowledgements. This work is partially supported by National Key Research and Development Program of China (2016YFC1400704), Shenzhen Research Program (JCYJ20170818164704758, JCYJ20150925163005055, CXB201104220032A), and Joint Lab of CAS-HK.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Ntire workshop and challenges @ cvpr 2019. https://competitions.codalab.org/competitions/21439#learn_the_details . Accessed: 2019-04-12.
- 2[2] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , July 2017.
- 3[3] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 252–268, 2018.
- 4[4] Antreas Antoniou, Amos J. Storkey, and Harrison A Edwards. Data augmentation generative adversarial networks. Co RR , abs/1711.04340, 2018.
- 5[5] Adrian Bulat, Jing Yang, and Georgios Tzimiropoulos. To learn image super-resolution, use a gan to learn how to do image degradation first. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 185–200, 2018.
- 6[6] Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-resolution: A new benchmark and a new model. ar Xiv preprint ar Xiv:1904.00523 , 2019.
- 7[7] Chang Chen, Zhiwei Xiong, Xinmei Tian, Zheng-Jun Zha, and Feng Wu. Camera lens super-resolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2019.
- 8[8] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. ar Xiv preprint ar Xiv:1805.09501 , 2018.
