LSR: A Light-Weight Super-Resolution Method

Wei Wang; Xuejing Lei; Yueru Chen; Ming-Sui Lee; C.-C. Jay Kuo

arXiv:2302.13596·eess.IV·February 28, 2023

LSR: A Light-Weight Super-Resolution Method

Wei Wang, Xuejing Lei, Yueru Chen, Ming-Sui Lee, C.-C. Jay Kuo

PDF

Open Access

TL;DR

This paper introduces LSR, a lightweight super-resolution method designed for mobile devices, which predicts residual images using a self-supervised framework without heavy deep networks, achieving good quality with low complexity.

Contribution

The paper presents a novel lightweight super-resolution approach that combines unsupervised and supervised learning modules, avoiding end-to-end deep networks for efficiency.

Findings

01

LSR achieves higher PSNR/SSIM than classical methods.

02

It has low computational complexity suitable for mobile platforms.

03

Offers better visual quality compared to exemplar-based methods.

Abstract

A light-weight super-resolution (LSR) method from a single image targeting mobile applications is proposed in this work. LSR predicts the residual image between the interpolated low-resolution (ILR) and high-resolution (HR) images using a self-supervised framework. To lower the computational complexity, LSR does not adopt the end-to-end optimization deep networks. It consists of three modules: 1) generation of a pool of rich and diversified representations in the neighborhood of a target pixel via unsupervised learning, 2) selecting a subset from the representation pool that is most relevant to the underlying super-resolution task automatically via supervised learning, 3) predicting the residual of the target pixel via regression. LSR has low computational complexity and reasonable model size so that it can be implemented on mobile/edge platforms conveniently. Besides, it offers better…

Tables9

Table 1. Table 1 : Statistics and parameter settings for easy and hard data.

Data Type	Easy	Hard
Ratio	56%	44%
Patch variance Range	$\leq$ 180	$\geq$ 180
Pixel initial MSE(ILR, HR)	21.78	158.05
Representation Types	[1, 3]	$R T_{h}$
Cluster Number	1	8
XGBoost Regressor Tree Number	50	500
XGBoost Regressor Max Depth	6	6
Fusion	No	Yes ( $F U_{h}$ )

Table 2. Table 2 : Average PSNR/SSIM with different settings of representation types (RT) for hard data tested on Set5 and Set14.

PSNR / SSIM	$R T_{h}$ =[1]	$R T_{h}$ =[2]	$R T_{h}$ =[3]
Set5	35.24 / 0.9523	36.12 / 0.9576	36.16 / 0.9578
Set14	31.32 / 0.9074	31.94 / 0.9133	31.96 / 0.9134
PSNR / SSIM	$R T_{h}$ =[4]	$R T_{h}$ =[5]	$R T_{h}$ =[1,2,3,4,5]
Set5	36.11 / 0.9573	36.26 / 0.9583	36.34 / 0.9586
Set14	31.91 / 0.9130	32.04 / 0.9141	32.11 / 0.9147

Table 3. Table 3 : Average PSNR/SSIM with different fusion schemes for hard data against Set5 and Set14.

PSNR/SSIM	$F U_{h}$ =1	$F U_{h}$ =2	$F U_{h}$ =3	$F U_{h}$ =4
Set5	36.33 / 0.9588	36.49 / 0.9592	36.57 / 0.9595	36.60 / 0.9597
Set14	32.12 / 0.9150	32.25 / 0.9156	32.30 / 0.9159	32.32 / 0.9161

Table 4. Table 5 : Comparison of computational complexity (FLOPs per pixel) and model sizes of five SR methods.

Complexity	FLOPs / pixel	Model Size
A+[5]	15.7K(4X)	1.06M (18.6X)
SRCNN[2]	114K (30X)	57.3K (1X)
VDSR[9]	1.33M (347X)	665K (11.6X)
LSR (Ours), V1	9.28K (2.42X)	774K (13.51X)
LSR (Ours), V2	3.83K (1X)	770K (13.45X)

Table 5. Table 6 : Calculation of FLOPs ( F 𝐹 F ), FLOPs per pixel ( F p subscript 𝐹 𝑝 F_{p} ) and model size ( M 𝑀 M ) for A+.

ILR Feature Extraction (IFE)
Filter Type	$C_{i}$	$K_{h}$	$K_{w}$	$H_{o}$	$W_{o}$	$C_{o}$	$F$	$F_{p}$	$M$
$D_{w}^{1}$	1	1	3	344	228	1	0.39M	5.00	3
$D_{h}^{1}$	1	3	1	344	228	1	0.39M	5.00	3
$D_{w}^{2}$	1	1	5	344	228	1	0.71M	9.00	5
$D_{h}^{2}$	1	5	1	344	228	1	0.71M	9.00	5
IFE Sub-total							2.20M	28.00	16

Table 6. Table 7 : Calculation of FLOPs ( F 𝐹 F ), FLOPs per pixel ( F p subscript 𝐹 𝑝 F_{p} ) and model size ( M 𝑀 M ) for SRCNN.

Steps	$C_{i}$	$K_{h}$	$K_{w}$	$H_{o}$	$W_{o}$	$C_{o}$	$F$	$F_{p}$	$M$
conv1	1	9	9	344	228	64	0.81B	10368	5248
conv2	64	5	5	344	228	32	8.03B	102400	51232
conv3	32	5	5	344	228	1	0.13B	1600	801
Total							8.97B	114368	57281

Table 7. Table 8 : Calculation of FLOPs ( F 𝐹 F ), FLOPs per pixel ( F p subscript 𝐹 𝑝 F_{p} ) and model size ( M 𝑀 M ) for VDSR.

Steps	$C_{i}$	$K_{h}$	$K_{w}$	$H_{o}$	$W_{o}$	$C_{o}$	$N_{l}$	$F$	$F_{p}$	$M$
conv1	1	3	3	344	228	64	1	90.35M	1152	576
conv2 - 19	64	3	3	344	228	64	18	104.09B	1327104	663552
conv20	64	3	3	344	228	1	1	90.35M	1152	576
post-process				344	228	1	1	78.43k	1	0
Total								104.29B	1329409	664704

Table 8. Table 9 : Calculation of FLOPs per pixel ( F p subscript 𝐹 𝑝 F_{p} ) and model size ( M 𝑀 M ) for LSR in each module.

Module 1: Unsupervised Representation Learning (URL)
$R T$	$C_{i}$	$K_{h}$	$K_{w}$	$C_{o}$	$N_{t y p e}$	$F_{p}$	$M$
Type 1, Spatial	1	1	1	1	0	0	0
Type 2, Central Saab	1	5	5	25	1	1225	625
Type 2, Central Saab	1	7	7	49	1	4753	2401
Type 3, Ringwise Saab	1	3	3	9	1	153	81
Type 4, Haar & PCA	1	2	2	4	2	56	32
Type 5, Laws & PCA	1	3	3	9	2	306	162

Table 9. Table 10 : Calculation of FLOPs and model size for VDSR.

V1 Summary
Complexity	$F_{p}$		$M$
Data Type	Easy	Hard	Easy	Hard
URL Sub-total	153	12986	81	3301
SFL Sub-total	0	0	105	374
SDL Sub-total	300	7522	9500	760256
Post-process	1	1	0	0
Sub-Total	454	20509	9686	763931
$w$	0.56	0.44	1	1
Total	9278		773617

Equations28

R_{t}^{i} = \frac{N _{L, t}^{i} R _{L, t}^{i} + N _{R, t}^{i} R _{R, t}^{i}}{N},

R_{t}^{i} = \frac{N _{L, t}^{i} R _{L, t}^{i} + N _{R, t}^{i} R _{R, t}^{i}}{N},

R_{o p}^{i} = t ϵ T min R_{t}^{i} .

R_{o p}^{i} = t ϵ T min R_{t}^{i} .

F

F

M

2 \times C_{i} \times K_{h} \times K_{w} - 1

2 \times C_{i} \times K_{h} \times K_{w} - 1

F

F

M

F

F

M

F_{p}

F_{p}

M

F_{p} = ma x ((3 \times N_{f c} - 1) \times N_{c}, 0) \times f,

F_{p} = ma x ((3 \times N_{f c} - 1) \times N_{c}, 0) \times f,

M = N_{f c} \times N_{c} .

M = N_{f c} \times N_{c} .

F_{p} = d_{M} \times N_{t r ee} \times f,

F_{p} = d_{M} \times N_{t r ee} \times f,

N_{l e a f} = 2^{d_{M}}, N_{p a r e n t} = d = 1 \sum d_{M} - 1 2^{d} = 2^{d_{M}} - 1,

N_{l e a f} = 2^{d_{M}}, N_{p a r e n t} = d = 1 \sum d_{M} - 1 2^{d} = 2^{d_{M}} - 1,

M = (2 \times N_{p a r e n t} + N_{l e a f}) \times N_{t r ee} \times N_{c} .

M = (2 \times N_{p a r e n t} + N_{l e a f}) \times N_{t r ee} \times N_{c} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image Processing Techniques · Photoacoustic and Ultrasonic Imaging · Image Processing Techniques and Applications

Full text

LSR: A Light-Weight Super-Resolution Method

Abstract

A light-weight super-resolution (LSR) method from a single image targeting mobile applications is proposed in this work. LSR predicts the residual image between the interpolated low-resolution (ILR) and high-resolution (HR) images using a self-supervised framework. To lower the computational complexity, LSR does not adopt the end-to-end optimization deep networks. It consists of three modules: 1) generation of a pool of rich and diversified representations in the neighborhood of a target pixel via unsupervised learning, 2) selecting a subset from the representation pool that is most relevant to the underlying super-resolution task automatically via supervised learning, 3) predicting the residual of the target pixel via regression. LSR has low computational complexity and reasonable model size so that it can be implemented on mobile/edge platforms conveniently. Besides, it offers better visual quality than classical exemplar-based methods in terms of PSNR/SSIM measures.

Index Terms— Super-resolution, Mobile Computing, Green Learning

1 INTRODUCTION

Single image super-resolution (SISR) [1] is an intensively studied topic in image processing. It aims at recovering a high-resolution (HR) image from its low-resolution (LR) counterpart. SISR finds wide real-world applications such as remote sensing, medical imaging, and biometric identification. Besides, it attracts attention due to its connection with other tasks (e.g., image registration, compression, and synthesis).

SISR is an ill-posed problem since multiple HR patches can map to the same LR patch. To solve this one-to-many mapping problem, SISR is typically formulated as a regularized optimization problem or a generative problem with supervised learning. For the former, one may impose priors to regularize the ill-posed problem, yet the performance improvement is limited. For the latter, there are two main approaches: exemplar-based (or dictionary-based) methods and deep-learning (DL) methods.

DL-based super-resolution methods have been dominating in the field since 2015 [2]. They have been intensively studied in the last eight years. They offer better HR images in terms of PSNR/SSIM quality metrics at the cost of higher network parameters and larger computational complexity. One of the main applications of the SR techniques is mobile platforms and consumer electronics (e.g., smart TVs). DL-based SR solutions cannot be easily implemented on resource-constrained computational platforms due to the price consideration.

To address this problem, a light-weight super-resolution (LSR) method is proposed in this work. LSR predicts the residual image between the interpolated low-resolution (ILR) and HR images using a self-supervised learning paradigm. LSR does not adopt the end-to-end optimization deep networks. Instead, it consists of three cascaded modules. First, it creates a pool of rich and diversified representations in the neighborhood of a target pixel via unsupervised learning. Second, it selects a subset from the representation pool that is most relevant to the underlying super-resolution task automatically via supervised learning. Third, it predicts the residual of the target pixel based on the selected features through regression via classical machine learning such as the XGBoost regressor. LSR offers visual quality that is better than exemplar-based methods and comparable with the entry-level DL-based SR solution, SRCNN, in terms of PSNR/SSIM measures.

It is worthwhile to highlight the value of this work. Our main contributions lie in the low computational complexity of the proposed LSR method. As presented in the experimental section, there are two versions of the LSR method, namely, LSR V1 and LSR V2. Both have around 380K parameters. We use the number of floating-point operations per pixel (FLOPs/pixel) to measure the complexity in inference. LSR V1 and LSR V2 demand 9.28K and 3.83K FLOPs/pixel, respectively. For complexity benchmarking, we choose two well-known and representative DL-based SR solutions; i.e., SRCNN and VDSR. SRCNN and VDSR demand 114K and 1.33M FLOPs/pixel, respectively. The LSR method has a clear advantage over DL-based solutions when being deployed on the mobile/edge platforms.

The rest of the paper is organized as follows. Related work is reviewed in Sec. 2. The proposed LSR method is presented in Sec. 3. Experimental results are shown in Sec. 4. Finally, concluding remarks and future research directions are discussed in Sec. 5.

2 REVIEW OF RELATED WORK

Exemplar-based Methods. Image patches are viewed as examples of local regions. Patches are partitioned and represented in the form of dictionary atoms. Finally, proper mappings from LR-to-HR patches are developed inside each partition. Examples include [3, 4, 5, 6, 7]. Sometimes, priors are leveraged for patch partitioning [8]. Since the dictionary size can be expanded flexibly, they are non-parametric methods. There are limitations with exemplar-based methods. First, the LR-to-HR mapping is based on hand-crafted features. There is no clear guideline in patch sizes for mapping learning. Second, the training of the LR-to-HR mapping is time-consuming [4, 5]. Last, the quality of their enhanced SR images is inferior to that achieved by modern DL-based methods.

DL-based Methods. The application of DL to the SISR problem can be traced back to SRCNN in 2015 [2]. Substantial advances have been made along this direction in the last eight years, e.g., [9, 10, 11, 12, 13, 14, 15, 16, 17]. In earlier years, the focus was on achieving higher performance (namely, better PSNR and SSIM) [2, 9, 10, 11, 12, 13]. Research on efficiency has been considered in recent years, e.g., [14, 15]. Other SISR-related problems have also been explored, e.g. unknown degradation kernels [16], magnification with a non-integer factor [17], etc. Although DL-based methods offer significant performance breakthrough, it is a major challenge to apply them to practical SR problems in the mobile/edge devices due to their heavy computational and memory costs. Besides, they lack mathematical transparency.

Green Learning. Green learning [18] is an emerging learning paradigm emphasizing lower computational complexities and smaller model sizes. It has a modular design that consists of three cascaded modules: 1) unsupervised representation learning, 2) supervised feature learning, and 3) supervised decision learning. For unsupervised representation learning, Kuo et al. interpreted the convolution operations in convolutional neural networks (CNNs) as joint spatial-spectral signal transforms in [19, 20, 21] and proposed two one-stage data-driven transforms, the Saak transform [20] and the Saab transform [21]. To achieve multi-stage signal transforms, Kuo developed a successive-subspace-learning (SSL) strategy in [20, 21]. For supervised feature learning, the problem of selecting the most discriminant (or relevant) features from the pool of rich representations for some classification (or regression) tasks based on user’s labels is examined in [22]. Green learning has been successfully applied to many applications. Our LSR method follows the same pipeline as elaborated below.

3 Proposed LSR METHOD

For self-supervised SR, LR images are obtained from HR images via bicubic down-sampling in training and test image sets. Following the standard pipeline, we only focus on the luminance (or Y) component. As a pre-processing step, a Lanczos interpolation is applied to LR images to yield interpolated LR (ILR) images whose resolution is the same as HR images. To regularize the ill-posedness of the problem, LSR uses the neighborhood of a target pixel to predict its residue, which is the difference between HR and ILR images at the pixel. Thus, the input and output to the proposed LSR system are an ILR patch (of size $15\times 15$ ) and the residual value of its center pixel, respectively.

It is desired to divide pixels into easy and hard two classes. Pixels in smooth regions are easy samples. Their residual values are small since the interpolation can predict their values quite well. Furthermore, their residuals can be predicted using a simple model of lower complexity. Pixels in complicated regions such as edges and textures are hard samples. Their residual values are larger, and a more complicated model is required. To exploit this property, we develop a simple mechanism to partition pixels based on the variance of their neighborhood. A pixel whose neighborhood has a smaller variance is an easy one. Otherwise, it is a hard one. We focus on hard samples for the rest of this section. The same idea applies to easy samples but the processing can be greatly simplified.

3.1 Module 1: Unsupervised Representation Learning

The objective of the first module is to generate a rich and diversified set of representations. Five types of representations are collected with some justifications.

•

Type 1: spatial representations. For a patch of size $15\times 15$ , it has 225 pixels values as representations of Type 1.

•

Type 2: central Saab representations. Since pixels closer to the central pixel are more important than distant pixels, we consider two windows of sizes $5\times 5$ , $7\times 7$ as shown in Fig.1 and apply the Saab transform to pixels within each window to yield the central Saab representations. A window of size $n\times n$ will yield $n^{2}$ Saab coefficients as the representations of Type 2 111A Saab transform of size $n\times n$ has one fixed kernel of equal weight $n^{-1/2}$ and $(n^{2}-1)$ AC kernels that are derived by the principal component analysis (PCA). We refer to [21] for more details about the Saab transform..

•

Type 3: ring-wise Saab representations. Ring-shaped neighborhoods are introduced to complement the center-shaped neighborhoods as shown in Fig. 2. We apply one-stage Saab transforms with $3\times 3$ blocks at stride 1 on $r_{3}[0,5]$ and at stride 3 on outter ring regions, leading to one DC coefficient and 8 AC coefficients per block. These 9-channel responses are decoupled due to PCA.

•

Type 4: Haar filtering followed by channel-wise PCA representations. For a neighborhood of size $2\times 2$ , the Haar filterbank yields four responses and each response is treated as one channel. Then, PCA is applied to each channel. Type 4 representation is derived from the original Haar response and PCA coefficients.

•

Type 5: Laws filtering followed by channel-wise PCA representations. For a neighborhood of $3\times 3$ , Laws’ filterbank [23] yields nine responses and each response is treated as one channel. Then, PCA is applied to each channel. Type 5 representation is derived from the original Laws response and PCA coefficients.

3.2 Module 2: Supervised Feature Learning

The total number of representations obtained from Module 1 is around 1500. A subset of the most relevant representations can be selected based on training data to feed into the regressor. This work adopts a mechanism called the relevant feature test (RFT) [22] to achieve this objective. Since RFT is relatively new, it is briefly reviewed below. Let $[f^{i}_{min},f^{i}_{max}]$ denote the value range of the $i^{th}$ representation, and partition the samples by a certain value $t$ ( $f^{i}_{min}\leq t\leq f^{i}_{max}$ ) in the $i^{th}$ representation into two non-overlapping subsets, denoted by $S^{i}_{L}$ and $S^{i}_{R}$ . Let $y^{i}_{L}$ and $y^{i}_{R}$ be the mean of target values in $S^{i}_{L}$ and $S^{i}_{R}$ . They are used as the estimated regression values of all samples in $S^{i}_{L}$ and $S^{i}_{R}$ , respectively. The RFT loss is defined as the sum of the estimated regression MSEs of $S^{i}_{L}$ and $S^{i}_{R}$ . Mathematically, it is in the form of

[TABLE]

where $N^{i}_{L,t}$ , $N^{i}_{R,t}$ , $R^{i}_{L,t}$ , and $R^{i}_{R,t}$ are the sample numbers and the estimated regression MSEs in subsets $S^{i}_{L}$ and $S^{i}_{R}$ , respectively, and $N=N^{i}_{L,t}+N^{i}_{R,t}$ . The RFT loss function of the $i^{th}$ representation is defined as the optimized estimated regression MSE over the set, $T$ , of all candidate partition points, i.e.,

[TABLE]

The lower the RFT loss, the better the representation. We compute the RFT loss values of all representations and sort them in ascending order to yield an RFT loss curve. The elbow point is considered to select a subset of representations with lower RFT loss. This set defines the relevant features to be fed into a regressor in Module 3.

3.3 Module 3: Supervised Decision Learning

In the training process, data augmentation is performed (via 90-degree rotations and flipping) to enlarge the training sample size, and the following two options, ”clustering” and ”prediction” fusion, are considered. Clustering. Perform K-means clustering on ILR patches based on their HOG features and then train the XGBoost regressor in each cluster using features selected from Module 2. Prediction Fusion. Augment each ILR patch for multiple times, perform the regression of each one, and take the average of all prediction results as the ultimate predicted residual value.

4 EXPERIMENTS

4.1 Experimental Setup

The SR experiments are conducted with a scaling factor of 2 and the BSD200 dataset [24] is adopted to train the proposed LSR model. The tests are performed on four datasets: Set5 [25], Set14[26], BSD100 [24], and Urban100 [6]. Following the standard routine, we only process the $Y$ channel for super resolution. Our model uses the $15\times 15$ ILR patches to predict the residual value of the center pixel of the patch. An ILR patch of $16\times 16$ is also obtained from the $15\times 15$ ILR patch by the Lanczos interpolation for HOG feature extraction. Table 1 shows statistics and the parameter setting for easy and hard data samples with notations $RT_{h}$ and $FU_{h}$ to represent various representation types and fusion schemes for hard samples.

Since the initial MSE of easy data is small, we adopt a simple procedure to predict the residual values of easy data. For hard data, several different settings are compared. First, the PSNR/SSIM performance of different representation types and the union of all five types are demonstrated in Table 2. Representation types 2-5 outperform representation types 1 by a clear margin, while the union of all five types gives the best performance. Here the RFT is applied to the union of all five representation types below to maintain high performance. However, one can choose a single representation type such as type 5 alone to lower the computational complexity.

Next, we compare the performance of four different decision schemes: 1) $FU_{h}=1$ : without clustering and prediction fusion, 2) $FU_{h}=2$ : with clustering but no prediction fusion, 3) $FU_{h}=3$ : with both clustering and prediction fusion (fusion by 2), and 4) $FU_{h}=4$ : with both clustering and prediction fusion (fusion by 4). The results are shown in Table 3, which confirm the effectiveness of clustering and prediction fusion.

4.2 Quality Performance Comparison

Table 3 demonstrates the quality comparison of two versions of the proposed LSR method (V1: $RT_{h}$ =[1,2,3,4,5], $FU_{h}$ =3, and V2: $RT_{h}$ =[5], $FU_{h}$ =3) and three light-weight SR methods (SelfExSR[6], A+ [5], and SRCNN[2]). Note that SRCNN is the simplest DL-based method. Here we do not include advanced DL-based solutions in the table since their model sizes and computational complexity are too high to be used on mobile/edge platforms. This table exhibits that LSR V1 and LSR V2 achieve the best SSIM performance among all benchmarking methods while their PSNR performance is close to SRCNN.

We also show four test SR images in Fig. 3 for visual comparison. LSR V1 and SRCNN offer better visual quality with shaper edges and textures than SelfExSR and A+. Although the visual quality of LSR and SRCNN is comparable, their complexity is quite different as presented in the following subsection.

4.3 Complexity and Model Size Comparison

The computational complexity is measured in terms of floating-point operations (FLOPs) per pixel in inference, and the model size in terms of the number of model parameters. Three benchmarking methods are compared to the proposed LSR in Table 5. As one of the best non-DL-based method, A+ (1024-atom dictionary version) shows reasonable FLOPs value but large model size. SRCNN (9-5-5 version) has very small model size, while its FLOPs is large. As a median size DL-based method, VDSR shows very large FLOPs value. Although model size of LSR V1 is comparable with VDSR, its FLOPs per pixel is only 0.70% of VDSR. Besides, when achieving similar PSNR/SSIM, The FLOPs per pixel of LSR V1 is only 8.11% of SRCNN. LSR V2 even reduces FLOPs per pixel to 3.35% of SRCNN. Our model LSR shows extremely low inference computaional complexity, and its model size is also acceptable for mobile devices.

5 Conclusion and Future Work

A light-weight SR method, called LSR, was proposed in this work. It offers good visual quality that is comparable with that of SRCNN, which is an entry-level DL-based method, but at a significantly lower computational complexity (i.e., 8.11% in terms of FLOPs per pixel by V1, even 3.35% by V2). Besides, we presented a wide range of design choices that can lower the computational cost even more with slight quality degradation (see Table 2 and Table 3).

There are several topics worth our future research. First, we would like to generalize our current method to other scale factors ( $\times 3$ , $\times 4$ , etc.). Second, it is desired to boost visual quality furthermore while keeping low computational complexity, which can be achieved by more effective ensemble learning. Third, it is critical to develop a real-time SR video solution. The main challenge from image-based to video-based SR is preservation of temporal smoothness.

6 APPENDICES

The detailed calculations on the computational complexity in inference (in terms of FLOPs) and the model size (in terms of the number of model parameters) of the methods involved in Table 5 are presented in this section. The image “woman.png” of resolution $344\times 228$ in the Set5 test dataset is used as an example. All the calculation is based on the original codes published by authors of each paper. “FLOPs per pixel” in each step is obtained by dividing FLOPs of the whole image by the pixel number in the final predicted HR. FLOPs, FLOPs per pixel, and model sizes are denoted by $F$ , $F_{p}$ and $M$ , respectively.

Interpolation from low resolution images to the same size of high resolution images is commonly adopted as a pre-processing step in all algorithms of consideration. Since the interpolation process is usually fast and no learned model required, our complexity computation below does not involve this procedure. Besides, different algorithms have different strategies in handling image borders. For fair comparison, we assume that all algorithms generate feature maps and predict HR image based on the ILR image.

6.1 FLOPs and Model Size of Typical Operations

There are several typical procedures involved in various SR algorithms. The calculation of $F$ and $M$ on these procedures is discussed below.

Pixel-wise operation. For a single pixel-wise operation (addition or multiplication) on a set of number $N$ images (or patches) with height $H$ , width $W$ , and depth (number of channels) $C$ , we have $F=H\times W\times C\times N$ .

Matrix Multiplication. For a matrix multiplication between a $T_{h}\times T_{w}$ transform matrix and $T_{w}\times N$ sample matrix for $N$ samples, we have

[TABLE]

with $T_{w}$ multiplications and ( $T_{w}-1$ ) additions for each element in the $T_{h}\times N$ output matrix.

3D Convolution or Filtering. For the convolution operations that generate 2D feature maps by 3D convolution kernels, we use $C_{i}$ to denote the number of channels of the input image, and $K_{h}$ and $K_{w}$ to represent height and width of the convolution kernel, respectively. To generate one spatial feature response in one feature map, we need

[TABLE]

operations, including $C_{i}\times K_{h}\times K_{w}$ multiplications and $C_{i}\times K_{h}\times K_{w}-1$ additions. The bias-adding operation demands one addition operation for each spatial point at a feature map. Thus, $F$ and $M$ for a 3D convolution operation without bias on an image can be computed as

[TABLE]

where $H_{o}$ and $W_{o}$ are the height and width of the feature maps, respectively, and $C_{o}$ is the number of feature maps. If there exists a bias term, we have

[TABLE]

6.2 A+

As an example-based SISR algorithm, A+ uses $6\times 6$ ILR patches (with overlapping width 2) to predict the the corresponding $6\times 6$ residue patches, and generate the average values for patch overlapping regions. A+ mainly contains three sequential procedures: 1) ILR feature extraction, 2) residue patch prediction, and 3) HR image prediction. The calculation of $F$ , $F_{p}$ and $M$ of A+ are given in Table 6. Based on the example image, 18480 $6\times 6$ patches are formed in the entire process.

ILR Feature Extraction. Four feature maps are generated using the first and the second order derivative filters ( $D^{1}$ , $D^{2}$ ) along the image height ( $D^{1}_{h}$ , $D^{2}_{h}$ ) and width ( $D^{1}_{w}$ , $D^{2}_{w}$ ), respectively.

Residue Patch Prediction. For a $6\times 6$ ILR patch, ILR raw features of 144 dimensions are formed by taking the corresponding $6\times 6$ region in four feature maps and conduct feature concatenation. Afterward, three steps are executed to generate the residue patches: 1) reduce ILR features from 144 dimensions to 28 dimensions, 2) calculate distance to 1024 ILR dictionary atoms to identify the closest atom for each ILR patch, and 3) predict the residue patch values by the regressor associated with the closest ILR dictionary atom for each ILR patch. All the three steps are implemented by 2D matrix multiplication. Then, raw regression predictions (36-D) are reshaped into $6\times 6$ patches to form the eventual residue patch prediction. In the third step, ”Regression Prediction”, of this procedure, $F$ is calculated by the closest regressor, while $M$ includes regressors associated all 1024 ILR dictionary atoms.

HR Image Prediction. Predicted HR patches are obtained by adding corresponding ILR values to the predicted residue patches. Then, they are used to reconstruct the complete predicted HR image one by one. Two $344\times 228$ all-zero matrices are generated, with one for pixel value accumulation, and the other for counting pixel coverage times from different patches. The final HR image prediction is obtained by the division of the accumulation matrix by the counting matrix. All steps in this procedure are pixel-wise operations.

6.3 SRCNN

SRCNN is a DL-based method that has three convolution layers with bias on ILR images to generate predicted HR images. The calculation of $F$ , $F_{p}$ and $M$ are shown in Table 7.

6.4 VDSR

VSDR is a 20-layer DL-based method with $3\times 3$ kernel for each layer. VDSR utilizes ILR images to predict residue images, and final predicted HR images are obtained by adding predicted ILR residues to the ILR images. The calculation explanatioin of $F$ , $F_{p}$ and $M$ are exhibited in Table 8, with $N_{l}$ denotes number of layers.

6.5 LSR

Since inference samples are partitioned into easy and hard samples, we compute FLOPs per pixel, $F_{p}$ , using the weighted sum of easy and hard samples, where the weight is determined by the ratio of easy and hard samples in representative images. The model size includes the model parameters in both partitions. $F_{p}$ and $M$ calculations for each module are provided in Table 9. The complexity calculation for V1 and V2 are, respectively, summarized in Table 10.

Module 1: Unsupervised Representation Learning (URL). Being slightly different from the 3D convolution operation without bias (eq. 5) using 3D kernels or filters, LSR uses channel-wise 2D filters. $RT$ in Table 9 denotes the representation type(s). $F_{p}$ and $M$ of LSR in Module 1 for a certain representation type are obtained by

[TABLE]

where $N_{type}$ counts the type number regarding to original filter or PCA filters for a certain representation type, which is typically used for Type 4 and 5 Representation. $N_{type}=0$ means no filtering operation needed, nor is filter parameter needed to store. $N_{type}=1$ means the normal transform operations and corresponding filters are required. $N_{type}=2$ means both normal transform and channel-wise PCA transform involved. We use $f$ to denote inference augmentation times for fusion. The sibling candidate samples for one inference sample undergo the complete prediction process. Thus, $F_{p}$ for one inference sample needs to consider the $f$ factor.

Module 2: Supervised Feature Learning (SFL). $F$ in this module is zero due to absence of mathematical operations. The model stores the representation indices which are selected as regression features. Thus, $M$ is determined by the number of features for regression ( $N_{fr}$ ) selected from representation pool.

Module 3: Supervised Decision Learning (SDL). There are three steps for inference samples in this module: cluster prediction, regressor prediction, and prediction fusion.

•

Cluster Prediction. Denote the number of clustering feature by $N_{fc}$ , and cluter number by $N_{c}$ . For one sample, the FLOPs consumed in its cluster label prediction derives from the calculation of L2 distance to all $N_{c}$ cluster centroids, which can be simplified as first term in eq. (11).

[TABLE]

including $N_{fc}$ subtraction, $N_{fc}$ multiplication, and $(N_{fc}-1)$ addition operations with respect to each cluster centroid. The maximum operation in eq. (11) is for the calculation generalization for easy data without clustering procedure. $F_{p}$ for each inference sample involves the multiplicaiton by $f$ number of sibling candidate samples. $M$ only contains all cluster centroids in the clustering procedure. It is equal to

[TABLE]

•

Regressor Prediction. LSR learns an XGBoost regressor in each cluster for prediction. The upper bound of FLOPs of one sample prediction by a XGBoost regressor with $N_{tree}$ number of boosting trees and maximum depth $d_{M}$ is calculated by $d_{M}\times N_{tree}$ , where $d_{M}$ is the FLOPs value for one boosting tree, as one sample at most traverses $d_{M}$ nodes until it arrives at one leaf node, and one node only performs one operation. Similar to the clustering procedure, FLOPs for one inference sample also need the multiplication by fusion number $f$ . Thus, we have

[TABLE]

For a complete binary decision tree with depth $d_{M}$ , the numbers of leaf nodes ( $N_{leaf}$ ) and parent nodes ( $N_{parent}$ ) are calculated by

[TABLE]

and its parameter number is $(2\times N_{parent}+N_{leaf})$ , with parent nodes storing feature index and partition threshold, and leaf nodes storing the prediction weight. Thus, the value of $M$ for all regressors is bounded by

[TABLE]

•

Prediction Fusion. Hard data need additional ( $f-1$ ) additional operation and one division operation to average the predictions from sibling samples.

The raw regression results are residue pixel values. We need a pixel-wise post-processing step that adds a residual to a ILR value to yield the ultimate HR prediction. The difference between V1 and V2 models lies in the URL representation preparation and SFL feature selection.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Irani and S. Peleg, “Improving resolution by image registration,” CVGIP: Graphical models and image processing , vol. 53, no. 3, pp. 231–239, 1991.
2[2] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE transactions on pattern analysis and machine intelligence , vol. 38, no. 2, pp. 295–307, 2015.
3[3] H. Chang, D.-Y. Yeung, and Y. Xiong, “Super-resolution through neighbor embedding,” in Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004. , vol. 1. IEEE, 2004, pp. I–I.
4[4] J. Yang, Z. Wang, Z. Lin, S. Cohen, and T. Huang, “Coupled dictionary training for image super-resolution,” IEEE transactions on image processing , vol. 21, no. 8, pp. 3467–3478, 2012.
5[5] R. Timofte, V. De Smet, and L. Van Gool, “A+: Adjusted anchored neighborhood regression for fast super-resolution,” in Asian conference on computer vision . Springer, 2014, pp. 111–126.
6[6] J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 5197–5206.
7[7] S. Schulter, C. Leistner, and H. Bischof, “Fast and accurate image upscaling with super-resolution forests,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2015, pp. 3791–3799.
8[8] P. Sandeep and T. Jacob, “Single image super-resolution using a joint gmm method,” IEEE Transactions on Image Processing , vol. 25, no. 9, pp. 4233–4244, 2016.