Adaptive Transform Domain Image Super-resolution Via Orthogonally   Regularized Deep Networks

Tiantong Guo; Hojjat S. Mousavi; and Vishal Monga

arXiv:1904.10082·cs.CV·September 4, 2019

Adaptive Transform Domain Image Super-resolution Via Orthogonally Regularized Deep Networks

Tiantong Guo, Hojjat S. Mousavi, and Vishal Monga

PDF

TL;DR

This paper introduces a novel deep learning approach for image super-resolution that operates in the transform domain using an orthogonally regularized, trainable DCT layer, achieving state-of-the-art results with fewer parameters.

Contribution

It proposes a new DCT-based deep super-resolution network with an orthogonality constraint, improving efficiency and performance over existing CNN methods.

Findings

01

Achieves state-of-the-art super-resolution quality.

02

Requires fewer parameters than traditional CNN approaches.

03

Performs well with limited training data.

Abstract

Deep learning methods, in particular, trained Convolutional Neural Networks (CNN) have recently been shown to produce compelling results for single image Super-Resolution (SR). Invariably, a CNN is learned to map the Low Resolution (LR) image to its corresponding High Resolution (HR) version in the spatial domain. We propose a novel network structure for learning the SR mapping function in an image transform domain, specifically the Discrete Cosine Transform (DCT). As the first contribution, we show that DCT can be integrated into the network structure as a Convolutional DCT (CDCT) layer. With the CDCT layer, we construct the DCT Deep SR (DCT-DSR) network. We further extend the DCT-DSR to allow the CDCT layer to become trainable (i.e., optimizable). Because this layer represents an image transform, we enforce pairwise orthogonality constraints and newly formulated complexity order…

Tables7

Table 1. TABLE I : Average PSNR, SSIM and IFC results on Set14 with scale factor 3 – different filter size and number setups.

	$m_{i} = 32$ , $i \in {1, \dots 14}$
	PSNR	SSIM	IFC
$n_{1} = 5$ , $n_{i} = 3$ , $i \in {2, \dots, 15}$	29.95	0.8322	4.43
$n_{i} = 5$ , $i \in {1, \dots, 15}$	29.87	0.8316	4.39
$n_{i} = 3$ , $i \in {1, \dots, 15}$	29.92	0.8318	4.42
	$m_{i} = 64$ , $i \in {1, \dots 14}$
	PSNR	SSIM	IFC
$n_{1} = 5$ , $n_{i} = 3$ , $i \in {2, \dots, 15}$	30.26	0.8380	4.55
$n_{i} = 5$ , $i \in {1, \dots, 15}$	29.96	0.8298	4.45
$n_{i} = 3$ , $i \in {1, \dots, 15}$	30.08	0.8301	4.47

Table 2. TABLE II : Different variants of ORDSR. OC stands for Orthogonality Constraint, CC stands for Complexity order Constraint. ✓ means the the layer is learnable or the constraint is in place during the learning.

Notation	CDCT layer learnable	OC	CC
ORDSR	✓	✓	✓
DSR-OC	✓	✓	-
DSR-CC	✓	-	✓
DSR-UC	✓	-	-
DCT-DSR	-	-	-

Table 3. TABLE III : Performance of variants of methods based on Table II . Table shows average (PSNR/SSIM/IFC) results on Set14 with scale factor 3.

Training Data Used (%)	$100 %$	$10 %$
ORDSR	30.26/0.8380/4.55	29.67/0.8265/4.30
DSR-OC	30.18/0.8314/4.42	29.28/0.8197/4.01
DSR-CC	30.02/0.8295/4.41	29.03/0.8131/3.78
DSR-UC	29.51/0.8240/4.09	28.21/0.8059/3.46
DCT-DSR	29.86/0.8337/4.39	29.10/0.8143/3.87
ORDSR-RI	29.56/0.8280/4.17	28.76/0.8074/3.52

Table 4. TABLE IV : PSNR comparisons over Set5, Set14, BSD100, and Urban100.

PSNR

Scale

Bicubic

ScSR

[9]

A+

[68]

SelfEx

[32]

SCN

[69]

SRCNN

[13]

FSRCNN

[17]

VDSR

[21]

DWSR

[22]

RDN

[70]

EDSR

[71]

DCT-DSR

proposed

ORDSR

proposed

ORDSR+

proposed

Set5

x2

x3

x4

33.64

30.39

28.42

35.78

31.34

29.07

36.55

32.58

30.27

36.50

32.62

30.32

36.58

32.61

30.41

36.66

32.75

30.48

36.94

33.06

30.55

37.52

33.66

31.35

37.55

33.69

31.98

37.93

34.19

32.54

37.62

33.72

32.08

37.50

33.75

32.05

38.08

34.31

32.62

38.12

34.37

32.65

Set14

x2

x3

x4

30.22

27.53

25.99

31.64

28.19

26.40

32.29

29.13

27.33

32.24

29.16

27.40

32.35

29.16

27.39

32.42

29.28

27.40

32.54

29.37

27.50

33.02

29.75

28.01

33.10

29.77

29.98

33.89

30.13

28.68

33.56

30.01

27.97

33.08

29.86

28.03

34.06

30.26

28.81

34.09

30.30

28.84

BSD100

x2

x3

x4

29.55

27.21

25.96

30.77

27.72

26.61

31.21

28.18

26.82

31.18

28.30

26.84

31.26

28.58

26.88

31.36

28.20

26.84

31.66

28.52

26.92

31.85

28.82

27.23

31.83

28.87

27.29

32.09

29.28

27.64

31.97

29.21

27.32

31.72

28.93

27.35

32.24

29.41

27.80

32.27

29.43

27.84

Urban100

x2

x3

x4

26.66

24.46

23.14

28.26

25.34

24.02

29.20

26.03

24.32

29.54

25.69

24.78

29.52

25.56

25.13

29.50

26.24

24.52

29.87

26.35

24.61

30.76

27.14

25.15

30.81

27.07

25.19

31.32

28.07

25.63

31.14

27.96

25.07

30.88

27.08

25.17

31.59

28.18

25.76

31.61

28.19

25.79

Table 5. TABLE V : SSIM comparisons over Set5, Set14, BSD100, and Urban100. Higher SSIM score (max=1) corresponds to greater structural similarity.

SSIM

Scale

Bicubic

ScSR

[9]

A+

[68]

SelfEx

[32]

SCN

[69]

SRCNN

[13]

FSRCNN

[17]

VDSR

[21]

DWSR

[22]

RDN

[70]

EDSR

[71]

DCT-DSR

proposed

ORDSR

proposed

ORDSR+

proposed

Set5

x2

x3

x4

0.9292

0.8678

0.8101

0.9485

0.8869

0.8263

0.9544

0.9088

0.8605

0.9538

0.9092

0.8640

0.9540

0.9080

0.8630

0.9542

0.9090

0.8628

0.9558

0.9140

0.8657

0.9586

0.9212

0.8820

0.9577

0.9214

0.8843

0.9590

0.9213

0.9015

0.9587

0.9218

0.8923

0.9573

0.9220

0.8850

0.9599

0.9226

0.9060

0.9602

0.9229

0.9063

Set14

x2

x3

x4

0.8683

0.7737

0.7023

0.8940

0.7977

0.7218

0.9055

0.8188

0.7489

0.9032

0.8196

0.7518

0.9050

0.8180

0.7510

0.9063

0.8209

0.7503

0.9088

0.8242

0.7535

0.9102

0.8294

0.7662

0.9104

0.8315

0.7665

0.9138

0.8349

0.7792

0.9133

0.8352

0.7668

0.9091

0.8337

0.7680

0.9184

0.8380

0.7823

0.9187

0.8383

0.7826

BSD100

x2

x3

x4

0.8425

0.7382

0.6672

0.8744

0.7647

0.6983

0.8864

0.7836

0.7087

0.8855

0.7778

0.7106

0.8850

0.7910

0.7110

0.8879

0.7863

0.7101

0.8920

0.7897

0.7201

0.8960

0.7976

0.7238

0.8947

0.7980

0.7243

0.8979

0.8007

0.7316

0.8975

0.8011

0.7276

0.8954

0.7992

0.7285

0.8984

0.8045

0.7363

0.8986

0.8048

0.7367

Urban100

x2

x3

x4

0.8408

0.7349

0.6573

0.8828

0.7827

0.7024

0.8938

0.7973

0.7186

0.8967

0.7864

0.7374

0.8970

0.8016

0.7260

0.8946

0.7989

0.7221

0.9010

0.7512

0.7270

0.9140

0.8272

0.7524

0.9127

0.8265

0.7591

0.9170

0.8354

0.7755

0.9157

0.8269

0.7582

0.9136

0.8193

0.7608

0.9181

0.8381

0.7787

0.9183

0.8384

0.7789

Table 6. TABLE VI : IFC comparisons over Set5, Set14, BSD100, and Urban100. Higher IFC score indicates better alignment of natural scene statistics.

IFC

Scale

Bicubic

ScSR

[9]

A+

[68]

SelfEx

[32]

SCN

[69]

SRCNN

[13]

FSRCNN

[17]

VDSR

[21]

DWSR

[22]

RDN

[70]

EDSR

[71]

DCT-DSR

proposed

ORDSR

proposed

ORDSR+

proposed

Set5

x2

x3

x4

5.72

3.45

2.28

6.94

3.98

2.57

8.48

4.84

3.26

7.35

4.05

3.12

7.36

4.32

2.91

8.05

4.58

3.01

8.06

4.56

2.76

8.76

4.85

3.36

8.69

4.47

3.31

8.80

4.74

3.81

8.77

4.79

3.66

8.56

4.87

3.78

8.82

4.96

4.02

8.83

4.98

4.03

Set14

x2

x3

x4

5.74

3.33

2.18

6.83

3.75

2.46

7.35

4.26

2.94

7.05

4.12

2.32

7.08

4.00

2.65

6.68

3.81

2.50

7.47

4.24

2.55

7.53

4.33

2.80

7.40

4.31

2.97

7.61

4.38

3.15

7.58

4.43

3.06

7.49

4.39

3.11

7.67

4.55

3.20

7.69

4.57

3.23

BSD100

x2

x3

x4

5.26

2.98

1.91

6.20

3.14

2.22

7.15

3.23

2.51

6.84

3.80

2.44

6.50

3.46

2.30

6.09

3.52

2.18

7.01

3.71

2.32

7.16

3.83

2.62

7.14

3.84

2.57

7.21

3.96

2.70

7.19

3.85

2.53

7.22

3.87

2.65

7.29

4.07

2.89

7.30

4.09

2.90

Urban100

x2

x3

x4

5.72

3.42

2.27

6.98

3.16

2.75

8.02

3.78

3.16

7.96

3.55

3.21

7.32

3.32

2.86

6.66

4.01

2.63

8.13

4.43

3.02

8.27

4.63

3.40

8.30

4.71

3.39

8.34

4.92

3.40

8.31

4.85

3.42

8.35

4.82

3.36

8.42

5.03

3.45

8.45

5.06

3.47

Table 7. TABLE VII : Image quality metric (PSNR/SSIM/IFC) comparisons over Set5 and Set14. 10 % percent 10 10\% training images are used.

Scale

FSRCNN

[17]

VDSR

[21]

EDSR

[71]

DCT-DSR

proposed

ORDSR

proposed

Set5

x2

x3

x4

36.18/0.9409/6.38

32.23/0.9097/3.85

29.87/0.8678/2.31

36.82/0.9515/7.10

32.84/0.9187/4.01

30.57/0.8763/2.98

36.75/0.9502/7.01

32.76/0.9123/3.92

30.35/0.8725/2.76

36.98/0.9533/7.23

32.87/0.9192/4.03

30.76/0.8772/2.97

37.24/0.9542/7.95

33.42/0.9201/4.16

31.14/0.8792/3.02

Set14

x2

x3

x4

32.29/0.8920/6.26

28.60/0.8089/3.69

26.82/0.7298/2.23

32.35/0.8986/6.53

29.03/0.8119/3.80

27.25/0.7420/2.50

32.32/0.8952/6.38

28.78/0.8102/3.72

27.26/0.7416/2.53

32.56/0.8979/6.62

29.10/0.8143/3.87

27.35/0.7482/2.62

32.98/0.9001/6.96

29.67/0.8265/4.30

27.89/0.7532/3.17

Equations44

X_{m, n} (k_{1}, k_{2}) = n_{2} = 0 \sum N - 1 n_{1} = 0 \sum N - 1 x_{m, n} (n_{1}, n_{2}) \times w_{k_{1}, k_{2}}^{dct} (n_{1}, n_{2})

X_{m, n} (k_{1}, k_{2}) = n_{2} = 0 \sum N - 1 n_{1} = 0 \sum N - 1 x_{m, n} (n_{1}, n_{2}) \times w_{k_{1}, k_{2}}^{dct} (n_{1}, n_{2})

w_{k_{1}, k_{2}}^{dct} (n_{1}, n_{2}) = C_{k_{1}, k_{2}} cos [\frac{π}{N} (n_{1} + \frac{1}{2}) k_{1}] cos [\frac{π}{N} (n_{2} + \frac{1}{2}) k_{2}]

w_{k_{1}, k_{2}}^{dct} (n_{1}, n_{2}) = C_{k_{1}, k_{2}} cos [\frac{π}{N} (n_{1} + \frac{1}{2}) k_{1}] cos [\frac{π}{N} (n_{2} + \frac{1}{2}) k_{2}]

x_{m, n} (n_{1}, n_{2}) = k_{2} = 0 \sum N - 1 k_{1} = 0 \sum N - 1 X_{m, n} (k_{1}, k_{2}) \times w_{k_{1}, k_{2}}^{dct} (n_{1}, n_{2})

x_{m, n} (n_{1}, n_{2}) = k_{2} = 0 \sum N - 1 k_{1} = 0 \sum N - 1 X_{m, n} (k_{1}, k_{2}) \times w_{k_{1}, k_{2}}^{dct} (n_{1}, n_{2})

< w_{k_{1}, k_{2}}^{dct}, w_{l_{1}, l_{2}}^{dct} >= {1, 0, if k_{1} = l_{1}, and k_{2} = l_{2} Otherwise

< w_{k_{1}, k_{2}}^{dct}, w_{l_{1}, l_{2}}^{dct} >= {1, 0, if k_{1} = l_{1}, and k_{2} = l_{2} Otherwise

f_{i} = w_{i} * x, \forall i \in {1, ..., 64}

f_{i} = w_{i} * x, \forall i \in {1, ..., 64}

z_{l} = max (a_{l} * W_{l} + b_{l}, 0)

z_{l} = max (a_{l} * W_{l} + b_{l}, 0)

\hat{f}_{high} = max (z_{n - 1} * W_{D} + b_{D}, 0) + f_{high}

\hat{f}_{high} = max (z_{n - 1} * W_{D} + b_{D}, 0) + f_{high}

\hat{y} = i = 1 \sum 64 w_{i} * g_{s} (f_{i})

\hat{y} = i = 1 \sum 64 w_{i} * g_{s} (f_{i})

\forall i \neq = j, ∥ v ec (w_{i})^{T} v ec (w_{j}) ∥_{2}^{2}

\forall i \neq = j, ∥ v ec (w_{i})^{T} v ec (w_{j}) ∥_{2}^{2}

∥ v a r (w_{t}) - v a r (w_{t}^{dct}) ∥_{2}^{2} = 0

∥ v a r (w_{t}) - v a r (w_{t}^{dct}) ∥_{2}^{2} = 0

v a r (w) = \frac{1}{N ^{2} - 1} m \sum (w^{m} - \frac{1}{N ^{2}} n \sum w^{n})^{2}

v a r (w) = \frac{1}{N ^{2} - 1} m \sum (w^{m} - \frac{1}{N ^{2}} n \sum w^{n})^{2}

L (Θ, B) = MSE loss \frac{1}{2} ∥ F (x) - y ∥_{2}^{2} + σ \frac{1}{2} l \sum m \sum m_{l} weight decay ∥ W_{l_{m}} ∥_{2}^{2} + γ \frac{1}{2} (i, j), i \neq = j \sum orthogonality constraint ∥ v ec (w_{i})^{T} v ec (w_{j}) ∥_{2}^{2} + λ \frac{1}{2} t \sum complexity order constraint ∥ v a r (w_{t}) - v a r (w_{t}^{dct}) ∥_{2}^{2}

L (Θ, B) = MSE loss \frac{1}{2} ∥ F (x) - y ∥_{2}^{2} + σ \frac{1}{2} l \sum m \sum m_{l} weight decay ∥ W_{l_{m}} ∥_{2}^{2} + γ \frac{1}{2} (i, j), i \neq = j \sum orthogonality constraint ∥ v ec (w_{i})^{T} v ec (w_{j}) ∥_{2}^{2} + λ \frac{1}{2} t \sum complexity order constraint ∥ v a r (w_{t}) - v a r (w_{t}^{dct}) ∥_{2}^{2}

Θ, B = Θ, B ar g min L (Θ, B)

Θ, B = Θ, B ar g min L (Θ, B)

\frac{\partial L}{\partial W _{l}}, \frac{\partial L}{\partial w _{i}}

\frac{\partial L}{\partial W _{l}}, \frac{\partial L}{\partial w _{i}}

\frac{\partial L}{\partial W _{l}^{a}} = - < (\hat{y} - y), \frac{\partial y}{\partial W _{l}^{a}} >_{F} + σ < W_{l}, \frac{\partial W _{l}}{\partial W _{l}^{a}} >_{F}

\frac{\partial L}{\partial W _{l}^{a}} = - < (\hat{y} - y), \frac{\partial y}{\partial W _{l}^{a}} >_{F} + σ < W_{l}, \frac{\partial W _{l}}{\partial W _{l}^{a}} >_{F}

\frac{\partial L}{\partial w _{i}^{a}} = - < (\hat{y} - y), \frac{\partial y}{\partial w _{i}^{a}} >_{F} + gradient of orthogonality constraint w.r.t w_{i}^{a} γ (j) \sum (v ec (w_{i})^{T} v ec (w_{j})) w_{j}^{a} + gradient of complexity order constraint w.r.t w_{i}^{a} λ \frac{\partial v a r ( w _{i} )}{\partial w _{i}^{a}} (v a r (w_{i}) - v a r (w_{i}^{dct}))

\frac{\partial L}{\partial w _{i}^{a}} = - < (\hat{y} - y), \frac{\partial y}{\partial w _{i}^{a}} >_{F} + gradient of orthogonality constraint w.r.t w_{i}^{a} γ (j) \sum (v ec (w_{i})^{T} v ec (w_{j})) w_{j}^{a} + gradient of complexity order constraint w.r.t w_{i}^{a} λ \frac{\partial v a r ( w _{i} )}{\partial w _{i}^{a}} (v a r (w_{i}) - v a r (w_{i}^{dct}))

\frac{\partial v a r ( w _{i} )}{\partial w _{i}^{a}} = \frac{2}{N ^{2} ( N ^{2} - 1 )} [N^{2} w_{i}^{a} - n \sum w_{i}^{n} - m \sum (w_{i}^{m} - \frac{1}{N ^{2}} n \sum w_{i}^{n})]

\frac{\partial v a r ( w _{i} )}{\partial w _{i}^{a}} = \frac{2}{N ^{2} ( N ^{2} - 1 )} [N^{2} w_{i}^{a} - n \sum w_{i}^{n} - m \sum (w_{i}^{m} - \frac{1}{N ^{2}} n \sum w_{i}^{n})]

L (Θ^{cnn}, B) = MSE loss \frac{1}{2} ∥ F (x) - y ∥_{2}^{2} + σ \frac{1}{2} l \sum weight decay ∥ W_{l} ∥_{2}^{2}

L (Θ^{cnn}, B) = MSE loss \frac{1}{2} ∥ F (x) - y ∥_{2}^{2} + σ \frac{1}{2} l \sum weight decay ∥ W_{l} ∥_{2}^{2}

g_{s} (X_{i}) := \overset{ˉ}{X}_{i} (p, q) = {\frac{1}{( N / S ) ^{2}} X_{i} (k, l), 0, if p = k \times S and q = l \times S Otherwise,

g_{s} (X_{i}) := \overset{ˉ}{X}_{i} (p, q) = {\frac{1}{( N / S ) ^{2}} X_{i} (k, l), 0, if p = k \times S and q = l \times S Otherwise,

x_{m, n}^{cdct} (n_{1}, n_{2}) = i = 1 \sum N \times N p = 0 \sum N - 1 q = 0 \sum N - 1 \overset{ˉ}{X}_{i} (m \times N + n_{1} - p, n \times N + n_{2} - q) \times w_{i} (p, q)

x_{m, n}^{cdct} (n_{1}, n_{2}) = i = 1 \sum N \times N p = 0 \sum N - 1 q = 0 \sum N - 1 \overset{ˉ}{X}_{i} (m \times N + n_{1} - p, n \times N + n_{2} - q) \times w_{i} (p, q)

x_{m, n}^{cdct} (n_{1}, n_{2}) = k = - N /2 S \sum N /2 S l = - N /2 S \sum N /2 S \overset{ˉ}{x}_{m - k, n - l}^{cdct} (n_{1} - k \times S, n_{2} - l \times S)

x_{m, n}^{cdct} (n_{1}, n_{2}) = k = - N /2 S \sum N /2 S l = - N /2 S \sum N /2 S \overset{ˉ}{x}_{m - k, n - l}^{cdct} (n_{1} - k \times S, n_{2} - l \times S)

x_{m, n}^{cdct} (n_{1}, n_{2}) = (N / S)^{2} i \sum N \times N \frac{1}{( N / S ) ^{2}} X_{i} (m, n) \times w_{i} (n_{1}, n_{2})

x_{m, n}^{cdct} (n_{1}, n_{2}) = (N / S)^{2} i \sum N \times N \frac{1}{( N / S ) ^{2}} X_{i} (m, n) \times w_{i} (n_{1}, n_{2})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Adaptive Transform Domain Image Super-resolution Via Orthogonally Regularized Deep Networks

Tiantong Guo, Hojjat S. Mousavi, and Vishal Monga T. Guo, H. Mousavi and V. Monga are with the Department of Electrical Engineering, The Pennsylvania State University, University Park, PA, 16802.

E-mail: see http://signal.ee.psu.eduManuscript received in August 2018, revised in February and April 2019.

Abstract

Deep learning methods, in particular, trained Convolutional Neural Networks (CNN) have recently been shown to produce compelling results for single image Super-Resolution (SR). Invariably, a CNN is learned to map the Low Resolution (LR) image to its corresponding High Resolution (HR) version in the spatial domain. We propose a novel network structure for learning the SR mapping function in an image transform domain, specifically the Discrete Cosine Transform (DCT). As the first contribution, we show that DCT can be integrated into the network structure as a Convolutional DCT (CDCT) layer. With the CDCT layer, we construct the DCT Deep SR (DCT-DSR) network. We further extend the DCT-DSR to allow the CDCT layer to become trainable (i.e., optimizable). Because this layer represents an image transform, we enforce pairwise orthogonality constraints and newly formulated complexity order constraints on the individual basis functions/filters. This Orthogonally Regularized Deep SR network (ORDSR) simplifies the SR task by taking advantage of image transform domain while adapting the design of transform basis to the training image set. Experimental results show ORDSR achieves state-of-the-art SR image quality with fewer parameters than most of the deep CNN methods. A particular success of ORDSR is in overcoming the artifacts introduced by bicubic interpolation. A key burden of deep SR has been identified as the requirement of generous training LR and HR image pairs; ORSDR exhibits a much more graceful degradation as training size is reduced with significant benefits in the regime of limited training. Analysis of memory and computation requirements confirms that ORDSR can allow for a more efficient network with faster inference.

Index Terms:

Deep learning, super-resolution, image transform domain, orthogonality constraint, complexity constraint.

I Introduction

Image Super-Resolution (SR) has emerged as one of the most significant ill-posed image processing and vision problems due to a variety of applications in civilian domains as well as in law enforcement [1]. With an increase in the number of mobile cameras and devices, enhancing resolution via a fast, memory efficient process is highly desirable.

SR problems are divided into multi-image SR [2, 3, 4, 5] and Single Image SR (SISR) according to the number of images required. Multi-image SR methods exploit geometric diversity in a set of LR images (of the same scene) to enhance resolution. The performance of these methods is limited by the number of LR images available and the success of geometric alignment/transformation methods that model the differences in the LR image set [3].

SISR has been of more recent interest and has been addressed largely by dictionary-based and sparsity constrained learning methods and more recently via deep learning algorithms. A typical learning/example based SR approach employs two dictionaries of HR/LR images/patches [6, 7, 8, 9, 10]. These dictionaries are often learned with sparse-coding methods to reconstruct the SR results. Many of these methods require handcrafted dictionary features which are not readily available [11]. Section II-A discusses these methods in detail.

Recently, deep learning methods have been shown to produce compelling state-of-the-art SR results and across a variety of different image collections [12]. One of the earliest deep SR methods was SRCNN [13] and it has been extended to train multiple coupled networks [14, 15, 16, 17]. Other variants include [18] which uses self-similar patches to explore the self-example based SR idea. Progressive [19] and recursive networks [20] also generate improved results with the help of diversified training data such as NTIRE[12]. These spatial domain mappings were boosted by global and local bypass structures as introduced in residual learning [21]. A key benefit of residual network structures is that they significantly reduce the training burden of the deep CNN, which is still constructed in the spatial domain.

Motivation: A recent trend is deep SR but by mapping LR to HR image in the transform domain, such as the Discrete Fourier Transform (DFT) or Discrete Wavelet Transform (DWT) [12, 22, 23]. These methods show improved results by exploiting the ability of an image transform to separate coarse and fine details of an image and hence simplifying the SR task. Specifically, the DWT has been extensively explored for the SR problem in traditional model-based frameworks [24, 25, 26, 27] and more recently also in deep networks [22].

We propose and develop a new adaptable transform domain deep SR method. Our starting point is the image DCT domain, in particular recognizing that the differences between a given LR-HR image pair manifest as change in high-frequency information while they typically share the same low-frequency signature (see analysis in Section III).

The contributions of this paper are as follows:

We propose a novel network structure that addresses the SR problem in an image transform domain: the transform as well as its inverse are part of the network; providing an end-to-end SR mapping. 2. 2.

We build a new convolutional DCT (CDCT) layer integrating the DCT procedure into the Deep SR network (DCT-DSR); as a key extension we generalize the CDCT to a transform layer allowing its filters to be trainable, so that we can optimize the image transform specifically for the image SR task. 3. 3.

We add pairwise orthogonality constraint on the newly introduced ‘transform layer’ to allow for efficient forward and inverse transform computations. This Orthogonally Regularized Deep SR network (ORDSR) simplifies the SR task by taking advantage of image transform domain while adapting the design of transform basis to the training image set. 4. 4.

Inspired by the structure of DCT basis, which exhibit an increase in spatial complexity with index, we enforce a newly formulated complexity order constraint, which encourages the complexity of each learned basis to be close to its DCT counterpart. 5. 5.

A key burden of deep SR has been identified as the requirement of generous training LR and HR image pairs; ORDSR shows a much more graceful degradation as training size is reduced with compelling improvements in the regime of limited training.

To the best of our knowledge, ORDSR is the first approach that allows optimization of basis functions for transform domain image SR within a deep learning framework.

A preliminary version of this work has appeared as a short conference article [28]. This manuscript significantly extends the 4-page conference article: First, the complexity order constraint is introduced in this work for the first time and the network structure is modified for better performance. Second, more analytical descriptions are added to explain the formulation and the training procedure of the new regularized deep network. Third, more comprehensive experiments are reported over the short conference article. This includes detailed comparisons against state-of-the-art deep learning based SR methods and the impact of network configuration on performance, including discussions about DCT-DSR. Fourth, we extend the test image sets from Set5 [29] and Set14 [30] to additionally include BSD100 [31] and Urban100 [32], each containing 100 test images. Fifth, a crucial new investigation is reported w.r.t varying training size(s) and ORDSR shows graceful degradation against a reduction in the number of training LR-HR pairs. Finally, an analysis of memory and computation is included to demonstrate the efficiency of ORDSR against competing alternatives.

This paper is organized as follows: Section II reviews related literature; Section III presents a new DCT domain Deep SR network (DCT-DSR) and extensions to a regularized network that allows the ‘transform layer’ to be trainable (ORDSR). Section IV provides experimental validation on benchmark datasets in both abundant and limited training scenarios. Section V concludes the paper with thoughts for future work.

II Related Work

II-A Single Image Super-Resolution

In the literature on learning-based methods for SR, sparse-coding methods have shown to be particularly effective [9, 33]. These techniques employ two dictionaries containing example LR and HR images/image patches. The goal is to then represent an LR image (or patch) in terms of its sparse code obtained via an LR dictionary. An HR image is obtained by using the same sparse code but applied to the dictionary of HR image patches. Several extensions of sparsity based SR have been developed including [30, 6, 10]. The focus of these methods has been to design/learn more suitable dictionaries and to find the optimal sparse representations of image patches, often by using suitable prior structure on the dictionary/sparse code [34]. In addition to sparse-coding based methods, self-example based methods have demonstrated success by exploring the self-similarity of the patches from the input image itself [8, 35, 36].

II-B SR With Image Transform Domain

In the sense of decomposing the image in terms of its different frequency components by an image transform, it is well acknowledged that the visual gaps which need to be filled between the LR and HR images lay within the high-frequency components of the image [1]. Producing SR results from LR input essentially becomes a problem of recovering the high-frequency components of the image based on the LR input, whose high-frequency details are missing. Transform domain methods can enable an alternate image representation where the SR mapping may be simpler and hence learned easily and accurately. The wavelet transform has been a popular choice [25, 24, 27, 37, 38, 26, 39] for traditional image SR. Recently, [22] developed a CNN network to reconstruct wavelet coefficients of the HR image yielding significant practical improvements.

II-C Deep Learning for Image Super-Resolution

Other than the conventional SR methods, recently advanced computational abilities brought on by Graphic Computation Units (GPUs) have boosted research on deep CNNs for SR. These methods have quickly become the new state-of-the-art performance standard [21, 16]. Deep learning SR methods can further be divided into two classes: methods that focus on maintaining strong fidelity against ground truth HR image and those that encourage perceptually motivated and visually attractive results. A key example of the latter is Generative Adversarial Networks (GANs) [40] such as [41] which develops a photo-realistic styled SR method by sampling an HR image patch from an estimated distribution of the natural image patches. GAN based methods provide visually pleasing results but pay less attention to maintain pixel-value fidelity to the original HR data [42], which makes them unsuitable for certain practical settings. Our proposed work is consistent with a majority of the literature [12] where the goal is to recover the HR image and training is based on minimizing the difference between network estimated SR and ground truth HR images.

Specifically for SR, Dong et al. introduced an SR CNN with three layers that outperformed previous sparse-coding based methods by a considerable margin and set the tone of using CNN for SR problems [13, 16]. SRCNN can be viewed as a non-linear mapping function between the input LR image and the target HR image. It takes the input as a whole and uses different filters convolving with the input image to generate different feature representations which later on are convolved with following neural layers for higher level representations. It has shown promising experimental performance and great flexibility with different neural network configurations. Since then, different efforts have been made to boost CNN performance by introducing deeper structures [17], utilizing residual bypasses by adding the input directly to the output of the CNN [21], creating different branches of networks to handle specific features [12], etc.

Fully Connected Networks have shown a considerable improvement in SR performance by combining ideas from sparse-coding[14]. Prior information [43] has shown recent promise in the deep learning framework. [44] uses the face prior to help composing the human face SR images, while [45] uses the structural feature priors to guide the network towards recovering detailed features.

Another combination of image transform domain and CNN was proposed recently [23]. Li et al. convert an input image into its Fourier Domain and feed the DFT coefficients to the CNN. Since convolution of the filters and the image in the spatial domain is equivalent to the multiplications of the image and filters’ corresponding Fourier Coefficients in Fourier Domain, they claim the operations of a CNN now becomes element-wise multiplications which speed up the training and inference of the network. Experimentally, the performance of this work is not at par with state of the art. Another limitation is the requirement of pre and post processing steps to compute the DFT/IDFT.

Our work seeks to advance deep SR by developing an adaptable transform domain method (which we refer to as DCT-DSR). Analytically, we aim to exploit the full potential of image transforms and hence enable their explicit optimization (learning) via a new network structure and a regularized cost function (we refer to this method as ORDSR). Experimentally, our focus is on efficiency in the network: in the sense of memory, computation and the ability to succeed even in limited training regimes, which are inherent to domains such as radar and medical imaging [46, 47, 48, 49].

III Orthogonally Regularized Deep SR

We first briefly review the DCT, IDCT and the SR problem with DCT. Then, we describe the ORDSR network structure while detailing the training and inference procedures.

This paper uses following notations: $*$ denotes the convolution operation; $vec(\cdot)$ denotes the vectorization operation which converts the matrix into a column vector; $<\cdot,\cdot>_{F}$ denotes the real valued Frobenius inner product.

III-A DCT, IDCT and Super-resolution

An image $\mathbf{x}(n_{1},n_{2})$ of size $H\times W$ can be decomposed into $H/N\times W/N$ blocks of size $N\times N$ .111We assume $H$ and $W$ are multiples of $N$ for simplicity of notation. For the $(m,n)^{th}$ block, the DCT coefficients are computed as:

[TABLE]

where $k_{1},k_{2}\in\{0,\ldots,N-1\}$ , and $\mathbf{w}^{\text{dct}}_{k_{1},k_{2}}(n_{1},n_{2})$ is the DCT basis function, specifically DCT-II basis, defined as:

[TABLE]

where $C_{k_{1},k_{2}}=\frac{\sqrt{1+\delta_{k_{1}}}\sqrt{1+\delta_{k_{2}}}}{N}$ and $\delta_{k}=1$ if $k=0$ , $\delta_{k}=0$ otherwise. For $N=8$ , there are $8\times 8$ DCT bases and each basis $\mathbf{w}^{\text{dct}}_{k_{1},k_{2}}$ is of size $8\times 8$ , as shown in Fig. 1.

Corresponding to the DCT, the inverse DCT (IDCT) for the $(m,n)^{th}$ block is computed as:

[TABLE]

Note that classical DCT is typically performed on $N\times N$ blocks of the original image [50].

Pairwise orthogonality of Basis Functions. The basis functions $\{\mathbf{w}^{\text{dct}}_{k_{1},k_{2}}\}_{k_{1},k_{2}=1,1}^{N,N}\in\mathbb{R}^{N\times N}$ are pairwise orthogonal, forming an orthogonal basis family:

[TABLE]

where $<\mathbf{w}^{\text{dct}}_{k_{1},k_{2}},\mathbf{w}^{\text{dct}}_{l_{1},l_{2}}>$ denotes the inner product of two basis functions.

We now develop a reorganization of the DCT coefficients and their computation, which we show in Section III-B helps facilitate the implementation of DCT within a CNN.

Zig-zag reorder. We treat DCT basis functions as filters and reorganize them in a zig-zag order as shown in Fig. 2.

The zig-zag function maps $\{\mathbf{w}^{\text{dct}}_{k_{1},k_{2}}\}_{k_{1},k_{2}=1,1}^{N,N}$ to $\{\mathbf{w}^{\text{dct}}_{i}\}_{i=1}^{N\times N}$ . This reordering is similar to that used in the baseline JPEG compression procedure [51].

Complexity order. Specifically, after the zig-zag reordering, as the index $i$ increases, the complexity of $\mathbf{w}^{\text{dct}}_{i}$ also increases, i.e. the lower end (smaller $i$ ) of $\{\mathbf{w}^{\text{dct}}_{i}\}_{i=1}^{N\times N}$ is corresponding to low-frequency filters, while the higher end (bigger $i$ ) represents the high-frequency ones.

Given an HR image, and its bicubic enlarged LR version222 An image is downsampled by a factor $c$ to generate the LR version, which is enlarged to its original size using bicubic interpolation., we can plot the average coefficient values generated by the reordered DCT filters $\{\mathbf{w}^{\text{dct}}_{i}\}_{i=1}^{N\times N}$ , as shown in Fig. 2. In the plot in Fig. 2, the difference between the coefficients increases with the (frequency) index. This suggests that the HR image and the LR image share the same low-frequency spectra, while they differ in high frequency content. In the DCT domain hence SR becomes the problem of recovering high-frequency DCT coefficients of the HR image from the corresponding LR ones. This insight is explicitly incorporated into the the proposed ORDSR network by focusing on reconstructing the high-frequency spectra – see Fig. 3.

III-B Network Structure

Let us denote the bicubic enlarged LR image as $\mathbf{x}$ , which is treated as preprocessing of the real input low-resolution image. Now the LR image $\mathbf{x}$ has the same size $W\times H$ as the desired HR image $\mathbf{y}$ . The ORDSR network takes the $\mathbf{x}$ and produces a resolution enhanced version of $\mathbf{x}$ which is as similar as possible to $\mathbf{y}$ . The network’s output can be denoted as $\mathbf{\hat{y}}$ . We treat the effect that the network has on the input as a nonlinear function: $F(\mathbf{x})=\mathbf{\hat{y}}$ .

The ORDSR consists of three major operations:

DCT cube representation. The input image $\mathbf{x}$ passes through a special layer called Convolutional DCT (CDCT) layer. The outputs of the CDCT are the DCT coefficients of $\mathbf{x}$ which is referred to as the DCT cube. 2. 2.

Non-linear mapping. The DCT cube is fed into a $D$ -layer CNN for detail restoration. The CNN serves as a non-linear mapping function using the parameters learned from the training phase to restore the missing high-frequency details of the inputs. Particularly, ORDSR adopts a residual bypass structure [52, 21, 22] for faster convergence. The DCT cube is also divided into two parts which consist of low-frequency and high-frequency spectra respectively. 3. 3.

IDCT reconstruction. The output of the $D$ -layer CNN and the low-frequency parts from the input are appended together to form a DCT cube for the SR image. The SR DCT cube is passed through the CDCT layer again (with the same filters) to reconstruct the SR image by performing transpose convolution (i.e. IDCT).

The overall network structure is shown in Fig. 3. Next we provide a detailed description of each of the three operations mentioned above.

III-B1 DCT Cube Representation

To integrate the DCT analysis within a deep network framework, we construct a convolutional DCT (CDCT) layer.

Initialization. The CDCT layer is initialized using the DCT bases $\{\mathbf{w}^{\text{dct}}_{i}\}_{i=1}^{N\times N}$ . For $N=8$ , there are $64$ filters $\{\mathbf{w}_{i}\}_{i=1}^{64}$ of size $8\times 8$ in the CDCT layer such that the complexity (high-frequency content) increases with the filter index. We set $N=8$ for ORDSR and from now on we take specific number 64 as the filter number of the CDCT layer .

The CDCT layer performs frequency analysis differently than traditional DCT. Unlike classical DCT that produces $8\times 8$ block-wise DCT coefficients, the CDCT layer produces $64$ frequency maps $\{\mathbf{f}_{i}\}_{i=1}^{64}$ for the whole image by convolving $\{\mathbf{w}_{i}\}_{i=1}^{64}$ with the input image $\mathbf{x}$ as in Eq. (5) with a stride of $S$ where $*$ is the convolution operation. Note that this stride size has a significant role in the efficiency of ORDSR as analyzed in Section IV-E1.

[TABLE]

These maps, $\{\mathbf{f}_{i}\}_{i=1}^{64}$ , form a cube called DCT cube. The DCT cube is essentially a reorganized version of classical block-wise DCT coefficients of the whole image as proved in the following Proposition.

Proposition 1

Eq (5) which performs a convolution of the input image $\mathbf{x}$ with the CDCT layer filters, $\mathbf{w}_{i}$ , generates a reorganized version of the DCT coefficients of the image and is equivalent to DCT transformation.

Proof: See in Appendix A.

As $i$ increases, $\mathbf{f}_{i}$ corresponds to higher frequency components of the whole image. Thus, we divide the DCT cube into two parts by a threshold $T$ , namely low-frequency spectral maps $\mathbf{f}_{\text{low}}=\{\mathbf{f}_{i}\}^{T}_{i=1}$ and high-frequency spectral maps $\mathbf{f}_{\text{high}}=\{\mathbf{f}_{i}\}^{64}_{i=T+1}$ , as shown in Fig. 4. Because ORDSR uses an unconventional stride333ORDSR standard setups, details see Section IV-A. $S=2$ , computation requirement is reduced – see Section IV-E1 for details.

III-B2 Non-linear Mapping

The mapping of LR to HR hi-frequency components is accomplished via a CNN consisting of $D$ convolutional layers (see Fig. 3). Each layer has a similar operation on its input $a_{l}$ , given by:

[TABLE]

where $\mathbf{z}_{l}$ is the output of the $l^{th}$ layer, $\mathbf{W}_{l}$ and $\mathbf{b}_{l}$ are the weights and bias of the $l^{th}$ layer. $\mathbf{W}_{l}$ is a representative notation of $m_{l}$ filters in $l^{th}$ layer; each has a dimension of $c_{l}\times n_{l}\times n_{l}$ . $\mathbf{b}_{l}$ is an $m_{l}$ dimensional bias vector. As is shown in Eq. (6), the convolutional layer takes the input $\mathbf{a}_{l}$ and applies $m_{l}$ convolutions on the input. This results in $m_{l}$ output representation maps. Then the output is processed by the ReLU operation $\max(\cdot,0)$ [53].

For $l=1$ , the Eq. (6) represents the processing of the input layer of the CNN, i.e. $\mathbf{a}_{1}=\{\mathbf{f}_{\text{low}},\mathbf{f}_{\text{high}}\}$ . $\mathbf{W}_{1}$ is a representative notation of $m_{1}$ filters in layer $1$ , where each of the filter has size $64\times n_{1}\times n_{1}$ .

For $l=2,...,D-1$ , Eq. (6) represents processing of the center layers, which takes the output from the previous layer $\mathbf{z}_{l-1}$ as its input $\mathbf{a}_{l}=\mathbf{z}_{l-1}$ . These layers have identical structure. $\mathbf{W}_{l}$ is a representative notation of $m_{l}$ filters in $l^{th}$ layer where each of the filters has the size of $c_{l}\times n_{l}\times n_{l}$ , which are specified in Section IV. Note that for CNNs, the number of the channels of each filter is equal to the number of filters of the previous layer, i.e. $c_{l}=m_{l-1}$ .

[TABLE]

For $l=D$ , the Eq. (7) computes the output layer of the CNN, which produces the restored $\mathbf{\hat{f}}_{\text{high}}$ . The output layer $\mathbf{W}_{D}$ is a representative notation of $(64-T)$ filters, where each of the filters has a size of $c_{D}\times n_{D}\times n_{D}$ . The output layer generates $(64-T)$ detail maps. The input $\mathbf{f}_{\text{high}}$ is added to the network output by utilizing a residual structure. Note that our choice of a residual structure is inspired by studies [52, 21, 22] which demonstrate that predicting the difference or residuals is typically a much simpler operation from an optimization standpoint. The $\mathbf{\hat{f}}_{\text{high}}$ serves as the final output of the $D$ -layer CNN.

Collectively, let us denote the parameter sets for the CNN as $(\mathbf{\Theta}^{\text{cnn}},\mathbf{B})$ , where $\mathbf{\Theta}^{\text{cnn}}=\{\mathbf{W}_{l}\}_{l=1}^{D}$ and $\mathbf{B}=\{\mathbf{b}_{l}\}_{l=1}^{D}$ . Then we denote the collective parameter sets of the ORDSR as $(\mathbf{\Theta},\mathbf{B})$ , where $\mathbf{\Theta}=\{\mathbf{\Theta}^{\text{cnn}},\{\mathbf{w}_{i}\}_{i=1}^{64}\}$ , which includes the filters from the CDCT (or transform) layer.

III-B3 IDCT Reconstruction

Based on the restored transform coefficients $\mathbf{\hat{f}}_{\text{high}}$ from the $D$ -layer CNN, we can generate the SR results. First, we append the $\mathbf{\hat{f}}_{\text{high}}$ to the $\mathbf{f}_{\text{low}}$ which are the low-frequency components of the input LR image as defined in Section III-B1. This generates an SR DCT cube $\mathbf{f}_{\text{SR}}=\{\mathbf{f}_{\text{low}},\mathbf{\hat{f}}_{\text{high}}\}$ with $64$ spectra.

By transpose convolving444Some literature [54, 55] refer this procedure as deconvolution, fractionally stride convolution or backward convolution in neural network setups. the CDCT layer filters $\{\mathbf{w}_{i}\}_{i=1}^{64}$ with the SR DCT cube $\mathbf{f}_{\text{SR}}$ , the network output $\hat{\mathbf{y}}$ is generated. This procedure can be viewed as a convolution of $\mathbf{w}_{i}$ with a $S$ zero-padded $\mathbf{f}_{i}\in\mathbf{f}_{\text{SR}}$ :

[TABLE]

where $*$ is the convolution operation and $g_{s}(\cdot)$ is a $S$ zero-padding function detailed in the supplementary document [56] (recall that $S$ is the stride used in DCT cube calculation.). Note that combined with the zero-padding function, the convolution between the $\mathbf{w}_{i}$ and $g_{s}(\mathbf{f}_{i})$ can be viewed as a transposed convolutional operation between $\mathbf{w}_{i}$ and $\mathbf{f}_{i}$ .

Proposition 2

Eq (8) with $\mathbf{f}_{\text{SR}}$ as input produces a spatial image, which is equivalent to the IDCT.

Proof: See in Appendix B.

To summarize Section III-B1 and III-B3, the CDCT layer can produce a DCT cube from an input image by performing convolution. At the same time, an image from a DCT cube can be generated by performing transpose convolution, which essentially is the IDCT. As shown in Fig. 3, the CDCT layer constructs a bridge between image transform domain and the image spatial domain.

Beyond enabling SR in the DCT domain, we show next that the basis filters of the CDCT layer can be trainable, i.e. optimizable. This opens a door towards finding customized and data-adaptive basis filters for the SR task. The optimization of CDCT/transform layer555After training, the filters in CDCT layer are new learned filters that can help perform a forward and inverse transform, which is indeed data-adaptive and not the DCT. For ease of exposition, we continue to refer to this layer as the CDCT layer and its output as the DCT cube respectively. Indeed the terms ‘transform layer’ and ‘CDCT layer’ are interchangeable in this paper and the context makes it clear whether the said transform is DCT or based on optimized filters/basis functions. filters must however be constrained to yield improved results, this is detailed in the next Section.

III-C Desired Transform Constraints

While transform domain mappings can enhance SR, an image transform (viz. the proposed CDCT layer) must obey certain properties. We pose two key constraints:

a pairwise orthogonality constraint on filters/basis functions of the CDCT layer to guarantee reconstruction via the transpose convolution based inverse, and 2. 2.

preservation of the complexity of the basis in terms of its order.

Orthogonality constraint. The aforementioned CDCT layer can, in fact, be learned and adapted to a given training image dataset. Pairwise orthogonality constraints can be captured by a regularization term given by

[TABLE]

where $i,j\in\{1,..,64\}$ and $vec(\cdot)$ is the vectorization operation which converts the matrix into a column vector.

This term is added to the network’s total cost function – see Eq. (12). As suggested in Eq. (4), any two distinct filter pairs in the CDCT layer should ideally have an inner product that evaluates to zero.

Complexity order constraint. Because we are essentially designing a frequency domain mapping, it is desirable to preserve the order of complexity of the DCT basis. To enforce this, we introduce a new regularization term:

[TABLE]

where $t\in\{1,...,64\}$ , $\mathbf{w}_{t}$ are the filters in CDCT/transform layer and $\mathbf{w}_{t}^{\text{dct}}$ is the corresponding DCT basis function/filter (as defined in Section III-B1). The variance of a filter $\mathbf{w}\in\mathbb{R}^{N\times N}$ is given by Bessel’s correction version [57]:

[TABLE]

where $N=8$ , $\mathbf{w}^{m}$ and $\mathbf{w}^{n}$ denote an arbitrary scalar entry in filter $\mathbf{w}$ . $\sum_{m}\mathbf{w}^{m}$ and $\sum_{n}\mathbf{w}^{n}$ denote the summation of all the elements inside $\mathbf{w}$ . That is, we encourage the variance of the optimized filters to be close to that of their DCT counterparts.

III-D Training and Inference: Regularized Optimization

To train ORDSR we minimize a cost function that captures the functionality of the network while maintaining the properties that the CDCT layer needs to satisfy. The inference for SR procedure is then described in detailed steps.

III-D1 ORDSR Training

The ORSDR network is trained by minimizing the following regularized loss function:

[TABLE]

The cost function has four parts: Mean Square Error (MSE) loss, weight decay, orthogonality constraint and complexity order constraint. In these cost terms, MSE loss captures the similarity between the SR results $F(\mathbf{x})$ and the ground truth $\mathbf{y}$ . Weight decay constraints are leveraged from the literature to prevent over-fitting [58]. $\mathbf{W}_{l_{m}}\in\mathbf{\Theta}^{\text{cnn}}$ is the $m$ -th weight of the CNN layer $l$ where there are $m_{l}$ filters in total. $\sum_{l}(\cdot)$ applies the weight decay term to each of the weights of the CNN and sums them together.

Positive trade-off parameters $\gamma$ and $\lambda$ control the balance between the constraints and other cost terms. $\sum_{(i,j)}(\cdot)$ applies the orthogonality constraint on every distinct filter pair in the CDCT/transform layer then sums them together. Similarly, $\sum_{t}(\cdot)$ applies complexity order constraint on each pair of optimized and reference (DCT) filter and sums the total.

The ORDSR is trained by using a back-propagation procedure that minimizes:

[TABLE]

Specifically, Eq. (13) is minimized using a stochastic gradient descent method [59]. At iteration $t$ , the CNN and the CDCT layer are updated as: $\mathbf{\Theta}^{t+1}=\mathbf{\Theta}^{t}-\eta\nabla_{\mathbf{\Theta}}\mathbf{L}$ , where $\eta$ denotes the learning rate. As $\mathbf{\Theta}=\{\mathbf{\Theta}^{\text{cnn}},\{\mathbf{w}_{i}\}_{i=1}^{64}\}$ , and $\mathbf{\Theta}^{\text{cnn}}=\{\mathbf{W}_{l}\}_{l=1}^{D}$ , the following gradients are to be computed666Note the update rules and the gradients for the bias terms are similar and are included in the supplementary document [56].:

[TABLE]

where $\mathbf{W}_{l}$ denotes one of the filters at $l^{th}$ layer of the CNN, representatively, and $\mathbf{w}_{i}$ denotes the $i^{th}$ filter in the CDCT layer. The equation for computing the gradient of an arbitrary entry within filter $\mathbf{W}_{l}$ in layer $l\in\{1,...,D\}$ is given by:

[TABLE]

where $\mathbf{W}^{a}_{l}$ denotes an arbitrary scalar entry within the representative filter $\mathbf{W}_{l}$ , and $<\cdot,\cdot>_{F}$ denotes the real value Frobenius inner product777For two real valued matrix $\mathbf{A}$ and $\mathbf{B}$ with same dimension, $<\mathbf{A},\mathbf{B}>_{F}:=\sum_{i,j}A_{i,j}B_{i,j}$ where $i,j$ are the indexes of the entries.. In Eq. (14), $\frac{\partial\mathbf{y}}{\partial\mathbf{W}^{a}_{l}}$ is computed by following the standard backpropagation rule for each layer $l$ [59]. For the CDCT filter $\mathbf{w}_{i}$ , the gradient w.r.t an arbitrary scalar entry $\mathbf{w}_{i}^{a}$ is given by:

[TABLE]

where $\frac{\partial\mathbf{y}}{\partial\mathbf{w}^{a}_{l}}$ is computed following the standard backpropagation rule. $\frac{\partial var(\mathbf{w}_{i})}{\partial\mathbf{w}_{i}^{a}}$ is the partial derivative of $var(\mathbf{w}_{i})$ w.r.t $\mathbf{w}^{a}_{i}$ given by:

[TABLE]

where $\mathbf{w}_{i}^{a}$ , $\mathbf{w}_{i}^{m}$ , and $\mathbf{w}_{i}^{n}$ denote an arbitrary scalar entry in CDCT filter $\mathbf{w}_{i}$ . $\sum_{a}\mathbf{w}_{i}^{a}$ , $\sum_{m}\mathbf{w}_{i}^{m}$ , and $\sum_{n}\mathbf{w}_{i}^{n}$ denote the summation of all the elements inside $\mathbf{w}_{i}$ . Detailed notations and derivations of Eq. (14), (15), and (16) can be found in the supplementary document [56].

The CDCT layer is initialized by the DCT filters as described in Section III-B1 and the $D$ -layer CNN is initialized using the Xavier method [60]. We use the well-known stochastic gradient descent Adam optimizer [58] during the training procedure. We adapt gradient clip and a step gradient descent for faster training. Specific choice of numerical optimization parameters is provided in Section IV-B.

III-D2 DCT-DSR Training

Note that, without optimizing the CDCT layer filters ( $\mathbf{w}_{i}\notin\mathbf{\Theta}$ ), the ORDSR is simplified to a baseline residual network performing SR in the DCT domain using a fixed CDCT layer. We call this network DCT-Deep SR (DCT-DSR). The DCT-DSR is trained by minimizing the following regularized loss function:

[TABLE]

Experiments in Section IV demonstrates the effectiveness of using DCT transform domain for image SR. Moreover, it further emphasizes that optimizing the transform layer basis functions with CDCT layer coefficients being learnable can significantly improve the image SR performance.

III-D3 Inference

Fig. 3 shows the inference procedure of the ORDSR network with $N=8$ . For an input LR image $\mathbf{x}$ , the goal of ORDSR is to generate its SR version $\mathbf{\hat{y}}$ as follows:

The input LR image $\mathbf{x}$ is convolved with CDCT layer producing a DCT cube $\{\mathbf{f}_{i}\}_{i=1}^{64}$ as in (5). 2. 2.

The DCT cube of $\mathbf{x}$ is divided into $\mathbf{f}_{\text{low}}$ and $\mathbf{f}_{\text{high}}$ corresponding to low and high-frequency spectra using a threshold $T$ . The exact separation process is described in Section III-B2; 3. 3.

A $D$ -layer CNN takes the DCT cube $\{\mathbf{f}_{\text{low}},\mathbf{f}_{\text{high}}\}$ as input and recovers the missing high-frequency information using a residual network structure, generating $\hat{\mathbf{f}}_{\text{high}}$ . 4. 4.

The $\hat{\mathbf{f}}_{\text{high}}$ is appended to $\mathbf{f}_{\text{low}}$ forming the SR-DCT cube $\mathbf{f}_{\text{SR}}$ . As the $\mathbf{f}_{\text{low}}$ is unchanged between $\mathbf{x}$ and its corresponding HR image, only $\mathbf{f}_{\text{high}}$ needs to be modified for generating $\mathbf{\hat{y}}$ . 5. 5.

The SR-DCT cube $\mathbf{f}_{\text{SR}}$ is transpose convolved with the filters in the CDCT/transform layer (to perform the IDCT/inverse transform) generating $\mathbf{\hat{y}}$ .

In Step 2, the CDCT layer uses an unconventional stride $S=2$ , which reduces the spatial size of the feature maps by factor of $4$ . This gives the ORDSR a huge advantage in the inference speed and memory requirements compared to most state of the art methods that operate in the spatial domain. Steps 1 and 5 are performed in the image spatial domain while Steps 2-4 are in the image transform domain where CDCT layer serves as a ‘bridge’ between two image domains by performing DCT/IDCT.

IV Experiments

IV-A Training and Test Data

The widely used 291 images dataset [61] is used for training. The images are augmented using three methods:

Rotating the images by $\{45^{\circ}$ , $90^{\circ}$ , $135^{\circ}$ , $180^{\circ}$ , $225^{\circ}$ , $270^{\circ}$ , $315^{\circ}\}$ ; 2. 2.

Horizontal and vertical flip; 3. 3.

Scaling by factors of $\{0.7,0.8,0.9\}$ .

The augmented images are treated as HR images and then are down-sampled by the factor of $c$ . Then the down-sampled images are enlarged using bicubic interpolation by the same factor $c$ to form the LR training images. Note that the HR image is cropped so that its width and height are multiples of $c$ . All the LR/HR images are further cropped into $40\times 40$ pixels sub-images with 10 pixels overlap for training. During the test phase, several standard data sets are used. Specifically, Set5 [29], Set14 [30], BSD100 [31] and Urban100 [32] are used to evaluate ORDSR888Test code and networks are available at http://signal.ee.psu.edu/ORDSR.html. Detailed training schemes are included in the supplementary document.. The metrics used for image quality assessment are PSNR, SSIM [62] and Information Fidelity Criterion (IFC) [63]. Note that while a few published methods work with larger datasets such as DIV2K [12], ImageNet [64], or MS-COCO [65] – our choice of the 291 images dataset [61] is for consistency and fairness of comparison against a large body of competing methods that all employ this dataset.

Both training and test phases of ORDSR and DCT-DSR only utilize the luminance channel information of the input images to be consistent with literature [9, 66, 21]. Chrominance channels Cb and Cr are directly enlarged by bicubic interpolation from LR images. These enlarged chrominance channels are combined with SR luminance channel to produce color SR results. Both training and test are conducted on an NVIDIA Titan X GPU (12GB) with the Tensorflow package [67].

IV-B Network Setup

In the training phase, the momentum and gradient clip are set to $0.9$ and $0.5$ respectively. The learning rate is initialized to $10^{-4}$ and updated every 30 epochs with a $25\%$ decrease. The network is first initialized using the non-learnable DCT bases and random Xavier [60] initialization for the CDCT layer and CNN layers, respectively. This forms the DCT-DSR network which is trained for $80$ epochs, only optimizing the CNN layers. Then the Orthogonality and Complexity Order constraints are enforced as well as including the CDCT layer in the trainable parameter set, i.e. $\mathbf{w}_{i}\in\mathbf{\Theta}$ , forming ORDSR. ORDSR is then trained for $80$ epochs. The stride $S$ is set to $2$ to eliminate block effects as well as to reduce the memory and computational requirements (see Section IV-E1). Unless stated otherwise, the standard configuration of ORDSR is as follows: $\gamma=3.5$ , $\lambda=0.75$ , $D=15$ , $T=4$ (for $c=3$ ; for other scale factors see Section IV-C1), $m_{i}=64$ where $i\in\{1,...,14\}$ and $n_{1}=5$ , $n_{i}=3$ , where $i\in\{2,...,15\}$ . All hyper-parameters are determined using cross-validation. During the training $128$ training patch pairs with the size of $40\times 40$ are randomly extracted in each batch.

IV-C Impact of ORDSR Network Parameters

IV-C1 Threshold T on DCT Cube

This threshold Separates the DCT cube into two parts as described in Section III-A and ORDSR focuses on restoring the high-frequency details $\hat{\mathbf{f}}_{\text{high}}$ . Fig. 6 shows the effect of varying $T$ on the PSNR of the SR results. A small $T$ implies a smaller fraction of DCT cube ( $\mathbf{f}_{\text{low}}$ ) is directly copied to the SR-DCT cube. Setting $T=0$ means that ORDSR exploits and maps all the frequency component maps. However, Fig. 6 reveals, for $T<5$ , decreasing the threshold does not affect SR image quality for all practical purposes. This confirms that the low-frequency spectra between LR and HR image are indeed shared. Further for scale factor of $c=2$ , $T$ is found to be $5$ and for $c=4$ we select $T=3$ . For smaller scale factor, more low-frequency coefficients are preserved during the downsampling, hence a bigger $T$ is suitable.

IV-C2 Number of filters and filter size in the CNN

In ORDSR, filters emerge from two categories: CDCT layer filters and collectively $\mathbf{\Theta}^{\text{cnn}}$ of the CNN. For CDCT layer, the size of the filters are predefined by the DCT basis. In this study, the DCT basis used has a filter size $8\times 8$ . Same for the number of the filters, it is associated with the filter size, i.e. $8\times 8=64$ .

As has been shown in the past in many CNN based SR methods [66, 17], the filter size and the number of filters influences the performance of the CNN. In ORDSR, the CNN uses a residual bypass structure as in [22, 21]. Though identical layer setups have shown effectiveness [12], some structural changes are necessary for ORDSR. From Section III-B2, in the CNN, the output layer always has $m_{D}=(64-T)$ filters since it needs to preserve the number of spectra. Besides the fixed parameter, in Table I, we report some configurations and corresponding results of ORDSR by changing the number and the size of filters in the CNN.

As is apparent from Table I, ORDSR generally benefits from an increase in the number of filters. For the filter size, Table I shows that the ORDSR benefits from the first layer having slightly bigger filters. This indicates increasing the input receptive field can help CNN generate a better representation of the input for later use. Also, smaller filters for layers in the center of the CNN produces more favorable results.

IV-C3 Number of Layers in the CNN

Going deeper is a tempting thing to do. For many a problem domain though, diminishing returns have been reported before with an increase in the number of layers [58]. For ORSDR, we observe a similar trend beyond $D=15$ . Figure 7 reveals that ORDSR can outperform VDSR [21] with $D=15$ layers.

One advantage of not going too deep is benefit from a memory and computational standpoint. ORDSR’s merits in this regard are elaborated upon in Section IV-E1. It is also worth noticing that, at $D=20$ , the $D$ -layer CNN in ORDSR and VDSR have similar structure, and ORDSR still outperforms VDSR thanks to domain inspired regularization.

IV-D Ablation Study: Benefits of Orthogonality and Complexity Order Constraints

To fully investigate the effects of the proposed constraints, we now introduce different variants of our proposed method. Table II illustrates the 5 different versions of our method and covers the cases of whether the transform/CDCT layer is learned or fixed and which constraints (if any) are active. DCT-DSR is the precursor to ORDSR where the CDCT layer is fixed with DCT filters and hence non-trainable. For DCT-DSR, the same network setup/parameters are used as ORDSR except that the $\mathbf{w}_{i}\notin\mathbf{\Theta}$ . For DCT-DSR hence, only the $D-$ layer CNN is learned in Fig. 3. Besides the variants in Table II, we also add a method, where the CDCT layer before optimization is randomly initialized (as opposed to initializing with DCT basis filters) and still learned with both constraints in place. We name it as ORDSR-RI (Orthogonally Regularized Deep SR with Random Initialization). Fig. 5 visualizes the filters that comprise the CDCT/transform layer in DCT-DSR, ORDSR, and ORDSR-RI. Note that the filters in ORDSR-RI are less interpretable compared to the DCT-DSR and ORDSR as the transform layer is initialized randomly.

Table III shows the performance of the aforementioned networks. Performance is evaluated both in abundant (using $100\%$ of the training data) and limited training scenarios (using $10\%$ of the training data). As is shown in Table III, in both training scenarios, ORDSR gives the best performance followed by DSR-OC. Comparing the performance between DSR-OC, -CC, and -UC, the orthogonality constraints have the strongest influence while the complexity constraints help to boost the performance in the limited training scenarios.

Comparing DCT-DSR with ORDSR results, we can observe that making the CDCT layer learnable while enforcing domain specific constraints is critical to performance improvement. Comparing ORDSR-RI with other variants, it is clear that DCT based initialization combined with enforcing one or more constraints outperforms ORDSR-RI. Meanwhile, ORDSR-RI is better than the variant where no constraints are used when optimizing this layer (DSR-UC) – underscoring the value of using powerful domain specific constraints.

IV-E Comparison Against State-of-the-Art SR Methods

In this Section, we compare ORDSR with representative state-of-the-art methods: both sparse-coding and deep learning based methods. Our experiments are partitioned into two scenarios – abundant and limited training. In these tests, ORDSR outperforms the state-of-the-art methods and the gains are particularly pronounced when training is limited. We also demonstrate the efficiency of the ORDSR by analyzing the network size and memory requirements.

We select well known methods from model-based, sparse-coding and recently developed deep learning based methods:

ScSR [9]: the most representative sparse-coding based SR method. ScSR constructs LR/HR image patch dictionaries with a shared sparse code representation of a given LR/HR image pair. 2. 2.

A+ [68]: a revised version of anchored neighborhood regression SR [72]. 3. 3.

SelfEx [32]: a model-based method that exploits the self-similarity within the image itself. 4. 4.

SCN [69]: CNN based method with a sparse prior. 5. 5.

SRCNN [13]: the most widely used CNN based SR method (the CNN consists of 3 layers). 6. 6.

FSRCNN [17]: an enhanced version of SRCNN with deeper structure and transpose convolution layer. 7. 7.

VDSR [21]: CNN that utilizes residual structure with a network depth of $20$ layers. 8. 8.

DWSR [22]: CNN that utilizes residual structure in the DWT domain. 9. 9.

RDN [70]: residual dense network that extracts abundant local features. 10. 10.

EDSR [71]: the winning entry of the NTIRE contest held at CVPR 2017 [12], which utilizes $32$ residual blocks and output branches to handle different scale factors. Each residual block contains $2$ convolutional layer and each convolutional layer has $256$ filters. 11. 11.

DCT-DSR: as described in Section IV-D.

As is standard practice [16], to create LR test images the known HR images are down-sampled and inputs to the network are created using bicubic interpolation999Except for EDSR [71] which uses deconvolutional layer at end to enlarge the image to desired size.. For fairness in comparison, comparison is focused on end-to-end deep SR models with all the deep learning models being (re)trained with the 291 image training dataset described in Section IV-A. The $D$ -layer CNN employed in ORDSR has a residual network structure, which is extendable to progressive and recursive models [19, 20, 73]. This is beyond the scope of this paper and a topic for future study.

IV-E1 Network Size and Memory Requirements

Using the CDCT layer with an unconventional stride gives ORDSR a huge advantage in faster training and test with less memory requirements. A typical test image lena.bmp of size $512\times 512$ , format float32 takes 257KB disk space. Feeding the test image through VDSR, each layer produces $64$ activation maps and each feature map has same size as input image. Assuming that a prefect memory release/recycle mechanism is in-place101010At any given time, only one layer’s activation maps are stored., at any given time, VDSR requires a minimum memory of: $\text{\it 257KB}\times 64\approx\text{\it 16MB}$ .

Feeding the same test image through ORDSR’s CDCT layer using a stride111111Other results of stride $S=3,4,5,8$ can be found in the supplementary document [56]. of $S=2$ , as shown in Fig. 4, reduces input image width and height both by a factor of $2$ . At any given time, ORDSR requires a minimum memory of $\text{\it 257KB}/2/2\times 64\approx\text{\it 4MB}$ , which is around a quarter of VDSR. This shows ORDSR uses about four times less memory than VDSR for activation maps during the inference. For a typical mobile camera image, which usually takes 5MB to 10MB121212A standard smart-phone photo takes about 8MB., using ORDSR can save 240MB to 480MB.

VDSR as reported in [21] uses two $3\times 3\times 64$ and eighteen $3\times 3\times 64\times 64$ convolutional layers. EDSR [71] has $32$ residual blocks where each has $2$ convolutional layers with $256$ filters in each layer. On the other hand, ORDSR in its most common realization uses: one $8\times 8\times 64$ CDCT layer, one $5\times 5\times 64\times 64$ , thirteen $3\times 3\times 64\times 64$ , and one $3\times 3\times 64\times 60$ convolutional layers, which combines to produce about $44\text{\it K}$ fewer parameters than VDSR, and about $\mathbf{95\%}$ less parameters than EDSR131313For detailed computation please see supplementary document [56].. During training, EDSR generates network snapshots of about $160\text{\it MB}$ , while ORDSR only uses $7\text{\it MB}$ with the Tensorflow [67] API. As shown in Fig. 9, the ORDSR achieves better performance among all deep learning based methods while using less parameters.

Since ORDSR produces smaller activation maps and has a network with less parameters, ORDSR requires less computations and trains faster than VDSR. Using the computational resources mentioned in Section IV-A, VDSR takes $0.12sec$ to train on one batch141414Each batch is of size $128\times 40\times 40\times 1$ , for both VDSR and ORDSR., while ORDSR takes $0.043sec$ per batch, which is about $2.7$ times faster.

IV-E2 Evaluation with Abundant Training

With $100\%$ of the training images used for training, Fig. 8 and Fig. 10 display example test images in detail. Note that in Fig. 8, deep learning based methods generate better results than sparse-coding based methods. The enlarged parts show the antennae of the monarch. ORDSR produces more defined edges and smoother background around the antennae than competing methods. Tables IV, V and VI report the PSNR, SSIM and IFC results of ORDSR and other methods, respectively. Out of the $10$ methods reported in Tables V, VI and IV, VDSR, EDSR, DCT-DSR, and ORDSR produce superior results than the rest. Note that DCT-DSR can produce comparable results as EDSR and better results than VDSR by utilizing the DCT image transform domain. ORDSR further improves the performance by optimizing the transform basis. Overall, ORDSR produces best results while using $44\text{\it K}$ fewer parameters than VDSR does and $\mathbf{5\%}$ of the parameters as EDSR does.

We adopt the geometric self-ensemble strategy (similar to [74]) to enhance the SR results. During the test, the input image $\mathbf{x}$ is flipped and rotated generating $8$ augmented versions $\mathbf{x}_{i}=T_{i}(\mathbf{x})$ where $T_{i}$ is one of the $8$ transformations151515These transformations are: vertical flip and $\{90^{\circ},180^{\circ},270^{\circ}\}$ rotations. Combining with identity, there are $8$ versions of the input image. including identity. Then the corresponding SR outputs $\{\hat{\mathbf{y}}_{i}\}_{i=1}^{8}$ of ORDSR are flipped and rotated back using the inverse transformation $T^{-1}(\cdot)$ . The final SR result is computed as $\hat{\mathbf{y}}=\frac{1}{8}\sum_{i}T^{-1}(\hat{\mathbf{y}}_{i})$ . We mark the results using this method as ORDSR+. As is shown in Tables IV, V and VI, this augmentation strategy can improve the SR results mildly.

Fig. 10 illustrates the merits of ORDSR in overcoming artifacts introduced by bicubic interpolation and adding more details in the SR results. In the original image of Fig. 10, there is no connection between these strips. However, after downsampling and rescaling using bicubic interpolation, the artifacts are generated as ‘new spurious blocks’ appear that connect these strips diagonally. Note that even state of the art deep learning methods are unable to overcome this. As is shown in Fig. 10, for all competing methods including VDSR and EDSR, this ‘fake edge’ is present and sometimes even enhanced. On the contrary, by virtue of operating in a carefully optimized transform domain, ORDSR exploits inter-frequency spectra information, and nearly eliminates these artifacts. In Fig. 11, the deep learning based method SR results are shown under scale factor $4$ . As is shown, the ORDSR can produce more details in the SR image and has higher numerical scores.

IV-E3 The Limited Training Scenario

For many real-world applications, such as medical and radar image SR [46, 47, 48, 49], abundant training is usually not available. We focus on two cases: $10\%$ and $35\%$ of the training image set employed in Section IV-E2 is used. To eliminate selection bias, several random selections were made and averaged results are presented.

We focus on five methods: FSRCNN, EDSR, VDSR, DCT-DSR, and ORDSR since they are shown to be most competitive. Figs. 12(a)-12(c) show the PSNR, SSIM and IFC measures plotted against percentage of training data used for these five methods. Note further that, in Section IV-E2 we compared methods exactly as they are reported in their respective articles. Here to particularly observe and isolate the effects of training size, for fairness each network is employed with the same number of layers ( $10$ ); where EDSR is realized with $10$ residual blocks. The plots in Figs. 12(a)-12(c) are for a scaling factor of $3$ and on the Set14 test set. Two major trends emerge: a.) ORDSR offers a more graceful degradation w.r.t a decrease in the number of training samples and compelling improvements when training is limited, and b.) DCT-DSR produces results somewhat comparable to VDSR and EDSR with complementary merits in low vs. high training regimes. The competitive performance of DCT-DSR shows the value of transform domain deep SR. Finally, the gains of ORDSR over DCT-DSR, particularly when training is low ( $10\%$ , $35\%$ cases) emphasizes the value of regularization in improving results.

Fig. 13 shows the limited training scenario SR results visually for the barbara.bmp image. Compared to FRCNN, EDSR, and VDSR, ORDSR generates better visual results as well as higher numerical assessments in both the $10\%$ and $35\%$ cases. Table VII provides more validation for the $10\%$ training case with scale factors varying from $2$ - $4$ over test sets Set5 and Set14 using $10$ -layer setups for all five methods. Quite clearly, ORDSR outperforms the competition and often by a fairly significant margin.

ORDSR does better in limited training because of two reasons: 1.) the SR mapping is simplified in the transform (for e.g. DCT) domain and hence even with limited training, the network can better approximate the non-linear mapping between LR and HR transform coefficients vs. methods that are based on spatial domain mappings, and 2.) the orthogonality and complexity regularizers play a crucial role in imparting desired structure to the transform/CDCT layer filters. As is readily apparent in Fig. 12, ORDSR is indeed the best, while DCT-DSR is the second best with 10 percent training because DCT-DSR also shares the two benefits mentioned above. Note also from Figs. 12(a)-12(c) that DSR-UC, which optimizes the transform/CDCT layer but without any constraints, indeed does poorly in low-training while still being competitive in the 100 percent training case. Orthogonality is indeed a crucial property for guaranteed forward and inverse transforms in our design – DSR-UC naturally leans towards more orthogonal transform filters when driven by abundant training but this property is significantly lost when training is limited.

In Fig. 12, both abundant and limited training size(s) are relative to size of the test set which contains about $291$ images.

V Conclusion

We develop a novel network structure to tackle the SR problem in an image transform domain. We start with DCT as the choice of image transform by proposing methods that integrate it into the network structure as a convolutional DCT (CDCT) or transform layer. We evolve the said DCT-DSR into a regularized deep network that allows for constrained optimization of basis filters that comprise the transform layer. Because orthogonality constraints are central to the transform, we call our method: Orthogonally Regularized Deep Super-Resolution (ORDSR). ORDSR is subsequently shown to outperform state of the art SR methods, particularly when training imagery (LR and HR image pairs) is limited.

In future research, other image transform domains such as DFT and DWT can be investigated for deep SR as presented in this work. This will require explicit integration of the transform within the network structure as well as design of new specialized constraints on the transform basis to arrive at new meaningful basis that are Fourier or wavelet like.

Appendix A Proposition 1: Sketch of the Proof

We need to show that the convolution of CDCT layer filters with the input image generates the DCT coefficients but in a zig-zag reordered form. We first define a zig-zag mapping function $Zig(\cdot)$ , such that it maps a 2D matrix to a 1D vector following Fig. 2: $Zig(k_{1},k_{2})=i$ where $(k_{1},k_{2})\in[0,N-1]\times[0,N-1]\rightarrow i\in[1,N\times N]$ 161616Note that in the appendices, we use $[a,b]$ denotes discrete intervals.. Thus $Zig(k_{1},k_{2})=i$ and $Zig^{-1}(i)=(k_{1},k_{2})$

The proof contains two cases, one in which stride size is equal to DCT block size ( $S=N$ ) and one in which stride size is less than DCT block size ( $S<N$ ): we list here the key steps of the proof for each case:

Case 1 $(S=N)$ : For DCT transform, we convolve the image $\mathbf{x}\in\mathbb{R}^{W\times H}$ with the DCT basis filters $\{\mathbf{w}_{i}\}_{i=1}^{N\times N}$ . For $(m,n)\in[1,H/N]\times[1,W/N]$ , $\mathbf{X}^{\text{dct}}_{m,n}(k_{1},k_{2})$ is the $(m,n)^{th}$ DCT coefficients block indexed by $(k_{1},k_{2})$ . For CDCT layer, convolve $x$ with $w_{i}$ we get: $\mathbf{X}_{i}^{\text{cdct}}:=\mathbf{x}*\mathbf{w}_{i}$ , where $\mathbf{X}_{i}^{\text{cdct}}\in\mathbb{R}^{\frac{W}{N}\times\frac{H}{N}}$ .

We then prove the proposition by mapping the DCT coefficients generated by the DCT basis to the convolutional results using the zig-zag function. For a fixed $(m,n)$ , the $\mathbf{X}^{\text{cdct}}_{i}(m,n)\in\mathbb{R}^{N^{2}\times 1}$ can be zig-zag re-indexed into $\mathbf{X}^{\text{cdct}}_{m,n}(i)\in\mathbb{R}^{N\times N}$ where $i\in[1,N^{2}]$ . Then we show for a fixed $(m,n)$ , $\mathbf{X}^{\text{cdct}}_{i}(m,n)=\mathbf{X}^{\text{dct}}_{m,n}(k_{1},k_{2})$ where $i\in[0,N^{2}]$ and $i=Zig(k_{1},k_{2})=i$ for $i\in[0,N^{2}]$ . The detailed index mapping can be found in Sec. III-A of the supplementary document [56].

Case 2 $(S<N)$ : There is an overlapping of $(N-S)$ pixels for both DCT transform and CDCT layer. The overlapping will create reorganized DCT coefficients based on overlapped inputs. Similar to Case 1, for a fixed $(m,n)$ , the key steps of the proof are unchanged (see Sec. III-B of [56]).

Appendix B Proposition 2: Sketch of the Proof

Similar to Appendix A, we need to show CDCT layer and IDCT transform result in the same spatial image. We define a zero-padding function, $g_{s}(\cdot)$ : For a given location $(p,q)\in[1,W]\times[1,H]$ :

[TABLE]

where $k\in[1,\frac{W}{S}],l\in[1,\frac{H}{S}]$ . Note that with $S=N$ , the term $\frac{1}{(N/S)^{2}}=1$ means that the $g_{s}(\cdot)$ keeps the $\mathbf{X}_{i}$ value as given and does not apply reweighing. This is a different case for $S<N$ . We denote $g_{s}(\mathbf{X}_{i}):=\bar{\mathbf{X}}_{i}\in\mathbb{R}^{W\times H}$ which is a zero padded version of $\mathbf{X}_{i}$ . Note that the transpose convolution between $\mathbf{w}_{i}$ and the $\mathbf{X}_{i}$ is the convolution between the $\mathbf{w}_{i}$ and the $g_{s}(\mathbf{X}_{i})$ .

Case 1 $(S=N)$ : for a fixed block $(m,n)$ , $k_{1},k_{2},n_{1},n_{2}\in[1,N]$ , IDCT by DCT basis, $\mathbf{x}^{\text{dct}}_{m,n}(n_{1},n_{2})$ is the recovered image given as in Eq. (3). On the other hand, IDCT using the CDCT layer is computed as:

[TABLE]

where $\bar{\mathbf{X}}_{i}(\cdot,\cdot)\neq 0$ , while $n_{1}-p$ and $n_{2}-q$ are multiples of $N$ (based on the definition of $\bar{\mathbf{X}}$ from Eq. (18)). We then use $Zig^{-1}$ to reorder the indices to show that $\mathbf{x}^{\text{dct}}_{m,n}(n_{1},n_{2})=\mathbf{x}^{\text{cdct}}_{m,n}(n_{1},n_{2})$ in Sec. IV-A of [56].

Case 2 $(S<N)$ : with overlapping of $N-S$ pixels, the aforementioned padding function remains the same but with $k\in[1,\frac{W}{S}],l\in[1,\frac{H}{S}]$ . For a fixed block $(m,n)$ , $k_{1}$ , $k_{2}$ , $n_{1}$ , and $n_{2}\in[1,N]$

IDCT by CDCT layer now is a weighted version of (19) with $\frac{1}{(N/S)^{2}}$ . When constructing the final output, at any given block $(m,n)$ for location $(n_{1},n_{2})$ ,

[TABLE]

with the repeated elements in the summation as detailed in Sec. IV-A of [56], the overlapping effect is canceled by

[TABLE]

We then use $Zig^{-1}$ to reorder the indices to show that $\mathbf{x}^{\text{dct}}_{m,n}(n_{1},n_{2})=\mathbf{x}^{\text{cdct}}_{m,n}(n_{1},n_{2})$ in Sec. IV-B of [56].

Bibliography74

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. C. Park, M. K. Park, and M. G. Kang, “Super-resolution image reconstruction: a technical overview,” IEEE Signal Processing Magazine , vol. 20, no. 3, pp. 21–36, 2003.
2[2] S. Farsiu, M. D. Robinson, M. Elad, and P. Milanfar, “Fast and robust multiframe super resolution,” IEEE Trans. on Image Processing , vol. 13, no. 10, pp. 1327–1344, 2004.
3[3] S. Farsiu, D. Robinson, M. Elad, and P. Milanfar, “Advances and challenges in super-resolution,” Int. Journal of Imaging Systems and Technology , vol. 14, no. 2, pp. 47–57, 2004.
4[4] Q. Yuan, L. Zhang, and H. Shen, “Multiframe super-resolution employing a spatially weighted total variation model,” IEEE Trans. on Circuits and Systems for Video Technology , vol. 22, no. 3, pp. 379–392, 2012.
5[5] X. Li, Y. Hu, X. Gao, D. Tao, and B. Ning, “A multi-frame image super-resolution method,” Signal Processing , vol. 90, no. 2, pp. 405–414, 2010.
6[6] S. Mallat and G. Yu, “Super-resolution with sparse mixing estimators,” IEEE Trans. on Image Processing , vol. 19, no. 11, pp. 2889–2900, 2010.
7[7] H. Chang, D.-Y. Yeung, and Y. Xiong, “Super-resolution through neighbor embedding,” in IEEE Conf. on Computer Vision and Pattern Recognition , 2004, vol. 1, pp. 1275––1282.
8[8] D. Glasner, S. Bagon, and M. Irani, “Super-resolution from a single image,” in IEEE Int. Conf. on Computer Vision , 2009, pp. 349–356.