Proximal Splitting Networks for Image Restoration

Raied Aljadaany; Dipan K. Pal; Marios Savvides

arXiv:1903.07154·cs.CV·March 19, 2019

Proximal Splitting Networks for Image Restoration

Raied Aljadaany, Dipan K. Pal, Marios Savvides

PDF

TL;DR

This paper introduces a novel deep learning framework for image restoration that models proximal operators as trainable convolutional networks, enabling efficient and high-fidelity recovery in tasks like denoising and super-resolution.

Contribution

It presents a new approach combining proximal splitting with neural networks, allowing iteration-specific tuning and reducing the number of iterations needed for high-quality image restoration.

Findings

01

Achieves state-of-the-art results on image denoising benchmarks.

02

Outperforms existing methods in image super-resolution tasks.

03

Reduces the number of iterations by an order of magnitude while maintaining performance.

Abstract

Image restoration problems are typically ill-posed requiring the design of suitable priors. These priors are typically hand-designed and are fully instantiated throughout the process. In this paper, we introduce a novel framework for handling inverse problems related to image restoration based on elements from the half quadratic splitting method and proximal operators. Modeling the proximal operator as a convolutional network, we defined an implicit prior on the image space as a function class during training. This is in contrast to the common practice in literature of having the prior to be fixed and fully instantiated even during training stages. Further, we allow this proximal operator to be tuned differently for each iteration which greatly increases modeling capacity and allows us to reduce the number of iterations by an order of magnitude as compared to other approaches. Our final…

Tables5

Table 1. Table 1: Denoising PSNR test results of several algorithms on BSD68 with noise levels of σ = { 15 , 25 , 50 } 𝜎 15 25 50 \sigma=\{15,25,50\} . Bold numbers denote the highest performing model, whereas Italics denotes the second highest. PSN outperforms previous state-of-the-art when the noise level is known, however matches it when it is unknown.

$σ$	BM3D [11]	WNNM [16]	EPLL [52]	MLP [8]	CSF [40]	TNRD [10]	DnCNN [51]	PSN-K (Ours)	PSN-U (Ours)
15	31.07	31.37	31.21	—	31.24	31.42	31.61	31.70	31.60
25	28.57	28.83	28.68	28.96	28.74	28.92	29.16	29.27	29.17
50	25.62	25.87	25.67	26.03	—	25.97	26.23	26.32	26.30

Table 2. Table 2: Denoising PSNR results of several algorithms on Set12 with noise levels of σ = { 15 , 25 , 50 } 𝜎 15 25 50 \sigma=\{15,25,50\} . Bold numbers denote the highest performing model, whereas Italics denotes the second highest. PSN outperforms state-of-the-art for many images.

	C.Man	House	Pepp	Starf.	Fly	Airpl.	Parrot	Lena	Barb.	Boat	Man	Couple
						$σ =$ 15
BM3D [11]	31.91	34.93	32.69	31.14	31.85	31.07	31.37	34.26	33.10	32.13	31.92	32.10
CSF [40]	31.95	34.39	32.85	31.55	32.33	31.33	31.37	34.06	31.92	32.01	32.08	31.98
EPLL [52]	31.85	34.17	32.64	31.13	32.10	31.19	31.42	33.93	31.38	31.93	32.00	31.93
WNNM [16]	32.17	35.13	32.99	31.82	32.71	31.39	31.62	34.27	33.60	32.27	32.11	32.17
TNRD [10]	32.19	34.53	33.04	31.75	32.56	31.46	31.63	34.24	32.13	32.14	32.23	32.11
DnCNN [51]	32.10	34.93	33.15	32.02	32.94	31.56	31.63	34.56	32.09	32.35	32.41	32.41
PSN-K (Ours)	32.58	35.04	33.23	32.17	33.11	31.75	31.89	34.62	32.64	32.52	32.39	32.43
PSN-U (Ours)	32.04	35.03	33.21	31.94	32.93	31.61	31.62	34.56	32.49	32.41	32.37	32.43
						$σ =$ 25
BM3D [11]	29.47	32.99	30.29	28.57	29.32	28.49	28.97	32.03	30.73	29.88	29.59	29.70
CSF [40]	29.51	32.41	30.32	28.87.	29.69	28.80	28.91	31.87	28.99	29.75	29.68	29.50
EPLL [52]	29.21	32.14	30.12	28.48	29.35	28.66	28.96	31.58	28.53	29.64	29.57	29.46
WNNM [16]	29.63	33.22	30.55	29.09	29.98	28.81	29.13	32.24	31.28	29.98	29.74	29.80
TNRD [10]	29.72	32.53	30.57	29.09	29.85	28.88	29.18	32.00	29.41	29.91	29.87	29.71
DnCNN [51]	29.94	33.05	30.84	29.34	30.25	29.09	29.35	32.42	29.69	30.20	30.09	30.10
PSN-K (Ours)	30.28	33.26	31.01	29.57	30.30	29.28	29.38	32.57	30.17	30.31	30.10	30.18
PSN-U (Ours)	29.79	33.23	30.90	29.30	30.17	29.06	29.25	32.45	29.94	30.25	30.05	30.12
						$σ =$ 50
BM3D [11]	26.13	29.69	26.68	25.04	25.82	25.10	25.90	29.05	27.22	26.78	26.81	26.46
MLP [8]	26.37	29.64	26.68	25.43	26.26	25.56	26.12	29.32	25.24	27.03	27.07	26.67
WNNM [16]	26.45	30.33	26.95	25.44	26.32	25.42	26.14	29.25	27.79	26.97	26.95	26.64
TNRD [10]	26.62	29.48	27.10	25.42	26.31	25.59	26.16	28.93	25.70	26.94	26.98	26.50
DnCNN [51]	27.03	30.02	27.39	25.72	26.83	25.89	26.48	29.38	26.38	27.23	27.23	26.09
PSN-K (Ours)	27.10	30.34	27.40	25.84	26.92	25.90	26.56	29.54	26.45	27.20	27.21	27.09
PSN-U (Ours)	27.21	30.21	27.53	25.63	26.93	25.89	26.62	29.54	26.56	27.27	27.23	27.04

Table 3. Table 3: The PSNR and SSIM results of several algorithms on four image super resolution benchmarks. PSN outperforms all previous algorithms significantly and consistently (except for the 3X case on Urban100). PSN also outperforms the works of [ 45 , 42 , 20 , 12 , 48 ] on all four benchmarks, whose specific results we present in the supplementary due to space constraints.

Algorithm	Scale	SET5	SET14	BSDS100	URBAN100
		PSNR / SSIM	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM
Bicubic		33.69 / 0.931	30.25 / 0.870	29.57 / 0.844	26.89 / 0.841
FSRCNN [13]		37.05 / 0.956	32.66 / 0.909	31.53 / 0.892	29.88 / 0.902
DRCN [25]		37.63 / 0.959	33.06 / 0.912	31.85 / 0.895	30.76 / 0.914
LapSRN [28]	2X	37.52 / 0.959	33.08 / 0.913	31.80 / 0.895	30.41 / 0.910
DRRN [44]		37.74 / 0.959	33.23 / 0.914	32.05 / 0.897	31.23 / 0.919
VDSR [24]		37.53 / 0.959	33.05 / 0.913	31.90 / 0.896	30.77 / 0.914
PSN (Ours)		38.09 / 0.960	33.68 / 0.919	32.33 / 0.901	31.97 / 0.921
Bicubic		30.41 / 0.869	27.55 / 0.775	27.22 / 0.741	24.47 / 0.737
FSRCNN [13]		33.18 / 0.914	29.37 / 0.824	28.53 / 0.791	26.43 / 0.808
DRCN [25]		33.83 / 0.922	29.77 / 0.832	28.80 / 0.797	27.15 / 0.828
LapSRN [28]	3X	33.82 / 0.922	29.87 / 0.832	28.82 / 0.798	27.07 / 0.828
DRRN [44]		34.03 / 0.924	29.96 / 0.835	28.95 / 0.800	27.53 / 0.764
VDSR [24]		33.67 / 0.921	29.78 / 0.832	28.83 / 0.799	27.14 / 0.829
PSN (Ours)		34.56 / 0.927	30.14 / 0.845	29.26 / 0.809	27.43 / 0.757
Bicubic		28.43 / 0.811	26.01 / 0.704	25.97 / 0.670	23.15 / 0.660
FSRCNN [13]		30.72 / 0.866	27.61 / 0.755	26.98 / 0.715	24.62 / 0.728
DRCN [25]		31.54 / 0.884	28.03 / 0.768	27.24 / 0.725	25.14 / 0.752
LapSRN [28]	4X	31.54 / 0.885	28.19 / 0.772	27.32 / 0.727	25.21 / 0.756
DRRN [44]		31.68 / 0.888	28.21 / 0.772	27.38 / 0.728	25.44 / 0.764
VDSR [24]		31.35 / 0.883	28.02 / 0.768	27.29 / 0.726	25.18 / 0.754
PSN (Ours)		32.36 / 0.896	28.40 / 0.786	27.73 / 0.742	25.63 / 0.768

Table 4. Table 4: The complexity in seconds for 3 different sizes

Method	BM3D [11]	WNNM [16]	EPLL [52]	MLP [8]	CSF [40]	TNRD [10]	DnCNN [51]	PSN-K	PSN-U
$256 \times 256$	0.65	203.1	25.4	1.42	2.11	0.010	0.016	0.017	0.018
$512 \times 512$	2.85	773.2	45.5	5.51	5.67	0.032	0.060	0.072	0.081
$1024 \times 1024$	11.89	2536.4	422.1	19.4	40.8	0.116	0.235	0.345	0.378

Table 5. Table 5: The PSNR and SSIM results of several algorithms for Super-res

Algorithm	Scale	SET5	SET14	BSDS100	URBAN100
		PSNR / SSIM	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM
Bicubic		33.69 / 0.931	30.25 / 0.870	29.57 / 0.844	26.89 / 0.841
A+ [45]		36.60 / 0.955	32.32 / 0.906	31.24 / 0.887	29.25 / 0.895
RFL [42]		36.59 / 0.954	32.29 / 0.905	31.18 / 0.885	29.14 / 0.891
SelfExSR [20]		36.60 / 0.955	32.24 / 0.904	31.20 / 0.887	29.55 / 0.898
SRCNN [12]		36.72 / 0.955	32.51 / 0.908	31.38 / 0.889	29.53 / 0.896
FSRCNN [13]		37.05 / 0.956	32.66 / 0.909	31.53 / 0.892	29.88 / 0.902
SCN [48]	2X	36.58 / 0.954	32.35 / 0.905	31.26 / 0.885	29.52 / 0.897
DRCN [25]		37.63 / 0.959	33.06 / 0.912	31.85 / 0.895	30.76 / 0.914
LapSRN [28]		37.52 / 0.959	33.08 / 0.913	31.80 / 0.895	30.41 / 0.910
DRRN [44]		37.74 / 0.959	33.23 / 0.914	32.05 / 0.897	31.23 / 0.919
VDSR [24]		37.53 / 0.959	33.05 / 0.913	31.90 / 0.896	30.77 / 0.914
PSN (Ours)		38.09 / 0.960	33.68 / 0.919	32.33 / 0.901	31.97 / 0.921
Bicubic		30.41 / 0.869	27.55 / 0.775	27.22 / 0.741	24.47 / 0.737
A+ [45]		32.62 / 0.909	29.15 / 0.820	28.31 / 0.785	26.05 / 0.799
RFL [42]		32.47 / 0.906	29.07 / 0.818	28.23 / 0.782	25.88 / 0.792
SelfExSR [20]		32.66 / 0.910	29.18 / 0.821	28.30 / 0.786	26.45 / 0.810
SRCNN [12]		32.78 / 0.909	29.32 / 0.823	28.42 / 0.788	26.25 / 0.801
FSRCNN [13]		33.18 / 0.914	29.37 / 0.824	28.53 / 0.791	26.43 / 0.808
SCN [48]	3X	32.62 / 0.908	29.16 / 0.818	28.33 / 0.783	26.21 / 0.801
DRCN [25]		33.83 / 0.922	29.77 / 0.832	28.80 / 0.797	27.15 / 0.828
LapSRN [28]		33.82 / 0.922	29.87 / 0.832	28.82 / 0.798	27.07 / 0.828
DRRN [44]		34.03 / 0.924	29.96 / 0.835	28.95 / 0.800	27.53 / 0.764
VDSR [24]		33.67 / 0.921	29.78 / 0.832	28.83 / 0.799	27.14 / 0.829
PSN (Ours)		34.56 / 0.927	30.14 / 0.845	29.26 / 0.809	27.43 / 0.757
Bicubic		28.43 / 0.811	26.01 / 0.704	25.97 / 0.670	23.15 / 0.660
A+ [45]		30.32 / 0.860	27.34 / 0.751	26.83 / 0.711	24.34 / 0.721
RFL [42]		30.17 / 0.855	27.24 / 0.747	26.76 / 0.708	24.20 / 0.712
SelfExSR [20]		30.34 / 0.862	27.41 / 0.753	26.84 / 0.713	24.83 / 0.740
SRCNN [12]		30.50 / 0.863	27.52 / 0.753	26.91 / 0.712	24.53 / 0.725
FSRCNN [13]		30.72 / 0.866	27.61 / 0.755	26.98 / 0.715	24.62 / 0.728
SCN [48]	4X	30.41 / 0.863	27.39 / 0.751	26.88 / 0.711	24.52 / 0.726
DRCN [25]		31.54 / 0.884	28.03 / 0.768	27.24 / 0.725	25.14 / 0.752
LapSRN [28]		31.54 / 0.885	28.19 / 0.772	27.32 / 0.727	25.21 / 0.756
DRRN [44]		31.68 / 0.888	28.21 / 0.772	27.38 / 0.728	25.44 / 0.764
VDSR [24]		31.35 / 0.883	28.02 / 0.768	27.29 / 0.726	25.18 / 0.754
PSN (Ours)		32.36 / 0.896	28.40 / 0.786	27.73 / 0.742	25.63 / 0.768

Equations40

y = k * x + ϵ

y = k * x + ϵ

x^{*} = a r g x min ∥ y - k * x ∥_{2}^{2} + g (x)

x^{*} = a r g x min ∥ y - k * x ∥_{2}^{2} + g (x)

p r o x_{h, β} (x) = a r g z min β ∥ z - x ∥_{2}^{2} + h (z)

p r o x_{h, β} (x) = a r g z min β ∥ z - x ∥_{2}^{2} + h (z)

p r o x_{h, β} (x) \approx x - 2 β^{- 1} \nabla h (x)

p r o x_{h, β} (x) \approx x - 2 β^{- 1} \nabla h (x)

x^{*} = a r g x min f (x) + g (x)

x^{*} = a r g x min f (x) + g (x)

x^{*}, v^{*} = a r g x, v min f (x) + g (v), s . t . v = x

x^{*}, v^{*} = a r g x, v min f (x) + g (v), s . t . v = x

x^{*}, v^{*} = a r g x, v min f (x) + g (v) + β ∥ v - x ∥_{2}^{2}

x^{*}, v^{*} = a r g x, v min f (x) + g (v) + β ∥ v - x ∥_{2}^{2}

x_{t} = p r o x_{f, β} (v_{t}) \leavevmode \leavevmode \leavevmode v_{t} = p r o x_{g, β} (x_{t - 1})

x_{t} = p r o x_{f, β} (v_{t}) \leavevmode \leavevmode \leavevmode v_{t} = p r o x_{g, β} (x_{t - 1})

x_{t} = v_{t} - 2 β^{- 1} [K^{T} (K v_{t} - y)]

x_{t} = v_{t} - 2 β^{- 1} [K^{T} (K v_{t} - y)]

v_{t} = p r o x_{g, β} (x_{t - 1}), \leavevmode \leavevmode x_{t} = v_{t} - 2 β^{- 1} [K^{T} (K v_{t} - y)] \leavevmode \leavevmode \forall t = 1, \dots, S

v_{t} = p r o x_{g, β} (x_{t - 1}), \leavevmode \leavevmode x_{t} = v_{t} - 2 β^{- 1} [K^{T} (K v_{t} - y)] \leavevmode \leavevmode \forall t = 1, \dots, S

v_{t} = Γ_{Θ}^{t} (x_{t - 1})

v_{t} = Γ_{Θ}^{t} (x_{t - 1})

Θ min L (x_{g t}, x_{S}) \leavevmode \leavevmode \leavevmode s.t \leavevmode \leavevmode \leavevmode v_{t} = Γ_{Θ}^{t} (x_{t - 1}), \leavevmode \leavevmode x_{t} = v_{t} - 2 β^{- 1} [K^{T} (K v_{t} - y)] \forall t = 1, \dots, S

Θ min L (x_{g t}, x_{S}) \leavevmode \leavevmode \leavevmode s.t \leavevmode \leavevmode \leavevmode v_{t} = Γ_{Θ}^{t} (x_{t - 1}), \leavevmode \leavevmode x_{t} = v_{t} - 2 β^{- 1} [K^{T} (K v_{t} - y)] \forall t = 1, \dots, S

Θ min ∥ x_{g t} - x_{1} ∥_{2}^{2} \leavevmode \leavevmode \leavevmode \leavevmode s.t \leavevmode \leavevmode \leavevmode \leavevmode v_{1} = Γ_{Θ}^{1} (x_{0}), \leavevmode \leavevmode \leavevmode x_{1} = v_{1} + K^{T} y

Θ min ∥ x_{g t} - x_{1} ∥_{2}^{2} \leavevmode \leavevmode \leavevmode \leavevmode s.t \leavevmode \leavevmode \leavevmode \leavevmode v_{1} = Γ_{Θ}^{1} (x_{0}), \leavevmode \leavevmode \leavevmode x_{1} = v_{1} + K^{T} y

Θ, D, C min Σ_{i}^{S} ∥ M_{i} x_{g t} - x_{i} ∥_{2}^{2}

Θ, D, C min Σ_{i}^{S} ∥ M_{i} x_{g t} - x_{i} ∥_{2}^{2}

v_{t} = Γ_{Θ}^{t} (D_{t}^{T} x_{t - 1} + C_{t} M_{t} K^{T} y)

x_{t} = v_{t} - 2 β^{- 1} [M_{t} K^{T} (K v_{t} - y)], t = 1, \dots,

Θ, D^{T}, C min Σ_{i}^{S} ∥ M_{t} x_{g t} - x_{i} ∥_{2}^{2}

Θ, D^{T}, C min Σ_{i}^{S} ∥ M_{t} x_{g t} - x_{i} ∥_{2}^{2}

v_{t} = Γ_{Θ}^{t} (D_{t}^{T} x_{t - 1} + C_{t} M_{t} y)

x_{t} = v_{t} + M_{t} y t = 1, \dots, S .

p r o x_{h, β} (x) = a r g z min β ∥ z - x ∥_{2}^{2} + h (x) + \nabla h (x)^{T} (z - x) + \frac{1}{2} (z - x)^{T} \nabla^{2} h (x) (z - x)

p r o x_{h, β} (x) = a r g z min β ∥ z - x ∥_{2}^{2} + h (x) + \nabla h (x)^{T} (z - x) + \frac{1}{2} (z - x)^{T} \nabla^{2} h (x) (z - x)

p r o x_{h, β} (x) = x - [\nabla^{2} h (x) + β I /2]^{- 1} \nabla h (x)

p r o x_{h, β} (x) = x - [\nabla^{2} h (x) + β I /2]^{- 1} \nabla h (x)

p r o x_{h, β} (x) \approx x - 2 β^{- 1} \nabla h (x)

p r o x_{h, β} (x) \approx x - 2 β^{- 1} \nabla h (x)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Proximal Splitting Networks for Image Restoration

Raied Aljadaany Dipan K. Pal Marios Savvides

Department of Electrical and Computer Engg.

Carnegie Mellon University

Pittsburgh, PA 15213

{raljadaa, dipanp, marioss}@andrew.cmu.edu

Abstract

Image restoration problems are typically ill-posed requiring the design of suitable priors. These priors are typically hand-designed and are fully instantiated throughout the process. In this paper, we introduce a novel framework for handling inverse problems related to image restoration based on elements from the half quadratic splitting method and proximal operators. Modeling the proximal operator as a convolutional network, we defined an implicit prior on the image space as a function class during training. This is in contrast to the common practice in literature of having the prior to be fixed and fully instantiated even during training stages. Further, we allow this proximal operator to be tuned differently for each iteration which greatly increases modeling capacity and allows us to reduce the number of iterations by an order of magnitude as compared to other approaches. Our final network is an end-to-end one whose run time matches the previous fastest algorithms while outperforming them in recovery fidelity on two image restoration tasks. Indeed, we find our approach achieves state-of-the-art results on benchmarks in image denoising and image super resolution while recovering more complex and finer details.

1 Introduction

Single image restoration aims to reconstruct a clear image from corrupted measurements. Assume a corrupted image $y$ can be generated via convolving a clear image $x$ with a known linear space-invariant blur kernel $k$ . This can be written as:

[TABLE]

where $\epsilon$ is an additive zero-mean white Gaussian noise and $*$ is the convolution operation. The problem of recovering the clean image is an ill-posed inverse problem. One approach to solve it is by assuming some prior (or a set of) on the image space. Thus, the clean image can be approximated by solving the following optimization problem

[TABLE]

where $\|.\|_{2}$ is the $l_{2}$ norm and $g$ is an operator that defines some prior (e.g $l_{1}$ norm is used to promote sparsity). A good prior is important to recover a feasible and high-quality solution. Indeed, priors are common in signal and image processing tasks such as inverse problems [29, 26] and these communities have spent considerable effort in hand designing suitable priors for signals [2, 39, 43].

In this work, we build a framework where the image prior is a function class during training, rather than a specific instantiation. Parameters of this function are learned which then acts as a fully instantiated prior during testing. This allows for a far more flexible prior which is learned and tuned according to the data. This is somewhat in contrast to what the concept of a prior is in the general machine learning setting where it is usually a specific function (rather a function class). In our application for image restoration, the prior function class is defined to be a deep convolutional network. Such networks have exceptional capacity at modelling complex functions while capturing natural structures in images such as spatial reciprocity through convolutions. A large function class for the prior allows the optimization to model richer statistics within the image space, which leads to better reconstruction performance (as we find in our experiments).

Our reconstruction network takes only two inputs, the corrupted image and the kernel, then it reconstructs the clear image with a single forward pass. The network architecture is designed following a recovery algorithm based on the half quadratic splitting method involving the proximal operator (see section 3). Proximal operators have been successfully applied in many image processing tasks (e.g. [32]). They are powerful and can work under more general conditions (e.g don’t require differentiability) and are simple to implement and study. Theoretically, the reconstruction process is driven primarily by the half-quadratic splitting method with no trainable parameters. The only need for training arises when the proximal operators in this architecture are modelled using deep networks. This training only helps the network to learn parameters such that the overall pipeline is effective. Our overall framework is flexible and can be applied to almost any image based inverse problem, though in this work we focus on image denoising and image super-resolution.

Contributions. We propose a novel framework for image restoration tasks called Proximal Splitting Networks (PSN). Our network architecture is theoretically motivated through the half quadratic splitting method and the proximal operator. Further, we model the proximal operators as deep convolutional networks that result in more flexible signal priors tuned to the data, while requiring a number of iterations an order of magnitude less compared to some previous works. Finally, we demonstrate state-of-the-art results on two image restoration tasks, namely image denoising and image super resolution on multiple standard benchmarks.

2 Prior Art

Priors in image restoration. The design of priors for solving inverse problems has enjoyed a rich history. There have been linear transforms proposed as priors which also assume low energy or sparsity etc [19, 5, 1]. However, it has been shown that these approaches fail when the solution space invoked by the assumed prior does not contain good approximations of the real data [14]. There have also been other signal processing methods such as BM3D [11], and those levaraging total variation [47] and dictionary learning [39, 14, 46] techniques that have been successful in several of these tasks. Proximal operators [34] and half quadratic splitting [47] methods have been useful in a few image recovering algorithms such as [5]. These methods assume hand-designed priors approximated via careful choice of the norm. Although this has provided much success, they are limited in the expressive capacity of the prior which ultimately limits the quality of the solution. In our approach, we learn the prior from a function class for our problem during the optimization. Thus our algorithm utilizes a more expressive prior that is informed by data.

Deep learning approaches and our generalization. Deep learning approaches have emerged successful in modelling high level image statistics and thus have excelled at image restoration problems. Some example applications include blind de-convolution [31], super-resolution [24] and de-noising[51]. Though these methods have powerful generalization, it remains unclear as to what the relation between the architecture and the prior used is. In this work however, the network is clearly motivated based on a combination of the proximal operator and half quadratic splitting methods. Further, we show that our network is a generalization of the approaches in [24, 51] in the supplementary.

Deep learning approaches which learn the prior It is worth mentioning that several approaches have used a proximal gradient decent algorithm [30], ADMM algorithm [37] or a gradient decent method [7] to recover an image where the prior is computed via a deep learning network. Although, these approaches preform well with respect to the reconstruction performance, they inherit important limitations in terms of computation efficiency. Proximal gradient decent, ADMM and gradient decent methods being first order iterative methods with linear or sub-linear convergence rate, typically require many tens of iterations for convergence. Each iteration consists of a forward pass through the network, which emerges as a considerable bottleneck. Our approach addresses this problem by allowing different proximal operators to be learned at every ‘iteration’. This increases modelling capacity of the overall network and allows for much lower iterations (an order of magnitude lower in our case).

Deep learning structure based on theoretical approaches There are several approaches that employed CNNs for image restoration [40, 23, 10] where the structure of the network is driven from a theoretical model for image recovery. In [40], the author proposed the cascade of shrinkage fields for image restoration. CSF can be seen as a ConvNet where the architecture of this network is a cascade of Gaussian conditional random fields [41]. [10] proposed trainable nonlinear reaction diffusion (TNRD) which is a ConvNet that has structure based on nonlinear diffusion models [36]. In [23], the authors proposed GradNet applied to noise-blind deblurring. The architecture of GradNet is motivated by the Majorization-Minimization (MM) algorithm [21]. However, these approaches assume that the prior term is driven from or approximated by Gaussian mixture model [52] which is represented by ConvNet. Our method is free of this assumption.

3 Proximal Splitting in Deep Networks

Our main goal is the design of a feed forward neural network for image restoration. For the architecture, we take inspiration from two tools in optimization. The first being the proximal operator which allows for a solution to a problem to be part of some predefined solution space. The second component being the half quadratic splitting technique which allows a sum of objective functions to be solved in alternating sequence using proximal operators. We briefly describe these two components and then utilize them to design our system architecture.

3.1 Proximal Operator

Let $h:R^{n}\rightarrow R$ be a function. The proximal operator of the function $h$ with the parameter $\beta$ is defined as

[TABLE]

If the function $h(x)$ is a strong convex function and twice differentiable with respect to $x$ and $\beta$ is large, the proximal operator of the function $h$ converges to a gradient descent step (a proof of this known result is presented in the supplementary). In this case the proximal operator can be approximated as:

[TABLE]

3.2 Half Quadratic Splitting

Now note that the image recovery optimization problem (Eq. 2) can be rewritten as:

[TABLE]

where $f$ is the data fidelity term and $g$ is a function that represents the prior. Depending on this prior $g(x)$ , Eq. 5 might be hard to optimize, especially when the prior function is not convex. The half quadratic splitting method [47] restructures this problem (Eq. 5) into a constrained optimization problem by introducing an auxiliary variable $v$ . Under this approach, the optimization problem in Eq. 5 is reformulated as:

[TABLE]

The next step is to convert the equality constraint into its Lagrangian.

[TABLE]

where $\beta$ is a penalty parameter. As $\beta$ approaches infinity, the solution of Eq. 7 is equivalent to that of Eq. 5, and can be solved in iterative fashion by fixing one variable, updating the other and vice versa. By using the proximal operator, these updating steps become

[TABLE]

When the image $x$ is fixed, the optimum $v$ can be found through the proximal operator involving $g(z)$ and $\beta$ . Clearly, this depends on the prior which is the $g$ function. For instance, if $g(z)$ is $l_{1}$ norm, the prox operator will be a soft threshold operator which forces the signal to be sparse [5]. However, for real-world image data, the optimal class of functions for $g$ is not known, which by extension makes the prox-operator sub-optimal for recovery. In the following subsection, we will propose an approach to optimize for the prox operator within a predefined search space.

As a final note, recall that since the added noise is assumed to be Gaussian, $f$ is the euclidean distances between the corrupted image and the clean image convolved with a kernel. Thus, $f(x)=\frac{1}{2}\|{y}-{k}*{x}\|^{2}_{2}$ which is convex and twice differentiable. This allows the updating step in Eq. 8 for $x$ to be approximated via gradient decent while modifying the proximal operator from Eq. 4:

[TABLE]

where $K$ is the matrix form of the convolution operation with $k$ and $K^{T}$ is its transpose.

3.3 Proximal Splitting Networks

We now develop the core optimization problem which will then yield the Proximal Splitting Network architecture. Our main approach for image recovery is to use the half quadratic splitting method which alternately updates the image $x$ and an auxiliary variable $v$ as in Eq. 8. Thus, for $S$ iterations the optimization procedure becomes

[TABLE]

Note that the update for $v_{t}$ still contains a proximal operator depending on the prior $g$ . There have been studies such as [30], where the authors replace the proximal operator with a Deep Denoising Network (DnCNN) [51]. Similarly, the authors in [18] use BM3D or the NLM denoiser rather than the proximal operator to update the value of an image. It is also important to note that these studies utilized these denoisers in an iterative fashion i.e. the same proximal operator with its parameters was used through multiple iterations. Considering that the number of iterations in these studies were significantly high (about 30 for both [30] and [18]) and the fact that every iteration requires a forward pass through a deep network, these methods have large computational bottleneck.

Although these methods work well, there is much to gain from defining a more flexible proximal operator in two ways. First, defining a larger solution space for the proximal operator would allow for the algorithm to choose more fitting operators. Secondly, allowing the proximal operator networks at different stages (iterations) to maintain separate weights allows for each operator to be tuned to the statistics of the estimated image at that stage. This also allows us to keep the number of iterations or stages very small in comparison due to the larger modelling capacity (3 in our experiments, which is an order of magnitude less than previous studies [30, 18]). Keeping these in mind, we choose the model for the proximal operator in our formulation to be a deep convolutional network, which introduces desirable inductive biases. These biases themselves act as our ‘prior’ while providing the optimization a large enough function search space to choose from. The rest of the prior (i.e. the actual parameters of the convolutional network) are tuned according to the data. Under this modification, the update step for $v_{t}$ becomes

[TABLE]

where $\Gamma^{t}_{\Theta}$ is a convolutional network for the $t^{th}$ iteration. Note that for every iteration, there is a separate such network. Defining the proximal operator (and the image prior) to be different for every iteration, the final optimization problem becomes

[TABLE]

where $x_{gt}$ is the ground truth clean image, $x_{S}$ is the final estimated image, $S$ is the number of stages (iterations) and $x^{0}$ is the initial input image. Note that the minimization in this formulation is only on $\Theta$ i.e. the parameters of the set of proximal networks $\Gamma^{t}_{\Theta}\leavevmode\nobreak\ \leavevmode\nobreak\ \forall t$ . The loss function here can be any suitable function, though we minimize the Euclidean error for this study assuming Gaussian noise. It is important to note a subtle point regarding the recovery framework. The minimization in Eq. 12 only tunes the network $\Gamma_{\Theta}$ towards the desired task based off the data. However, the core algorithm for reconstruction is still based on Eq. 10 i.e. iterations of the half quadratic splitting based reconstruction. It is also useful to observe the interplay between the objective function and the constraints. The first constraint and the loss objective in Eq. 12 work to project the recovered image onto the image space while the second constraint pushes the recovered image to be as close as possible to the the corrupted input image. A single iteration over these constraints according to half quadratic splitting, and the proximal network $\Gamma_{\Theta}$ together result in what we call the Proximal Block as shown in Fig. 3. The Proximal Block is the fundamental component using which the overall network is built (as we describe soon).

Multi-scale Proximal Splitting Network. Multi-scale decomposition has been widely applied in many applications, such as edge-aware filtering [35], image blending [9] and semantic segmentation [15]. Multi-scale architecture extensions have also emerged as a standard technique to further improve the performance of deep learning approaches to image recovery tasks such as image de-convolution [31] and image super-resolution [28]. We find that the multi-scaling is useful incorporate it into the Proximal Splitting Network algorithm. These approaches usually require that the output of each intermediate scale stage be the cleaned/processed image at that scale. Complying with this, multi-scaled PSN networks are designed such that the intermediate outputs form a Gaussian pyramid of cleaned images. For better performance, we apply reconstruction fidelity loss functions at each level of the pyramid. This also helps provide stronger gradients for the entire network pipeline, especially the first few layers which are typically harder to train.

Proximal Splitting Network Architecture for Image Restoration. Finally, we implement Eq. 12 to arrive at the PSN architecture while utilizing multiple Proximal Blocks in Fig. 4. The number of Proximal Blocks (from Fig. 3) equals the number of stages or iterations for the half-quadratic splitting method (Eq. 12) which we set to be 3 i.e. $S=3$ . Recall that this is an order of magnitude lesser than some previous works [30, 18]. In Fig. 4, the input image is the corrupted image convolved with the $K^{T}$ (e.g the input image is the noisy image for image denoising and it is the up sampled image via bi cubic interpolation for image super-resolution in the experiment part) . The down sampling is achieved via bi-cubic down sampling and the up sampling by a de-convolution layer [33]. Through a preliminary grid search, we find that $\beta=8$ works satisfactorily.

4 Empirical Evaluation on Image Restoration

We evaluate our proposed approach against state-of-the-art algorithms on standard benchmarks for the tasks of image denoising and image super resolution. For training we use Adam [27] for 50 epochs with a batch size of 128 for all models. Runtimes for evaluated PSN network are on par with the fastest algorithms while outperforming previous state-of-the-arts (provided in the supplementary).

4.1 Image De-noising

Our first task is image denoising where given a noisy image (with a known and unknown level of noise), the task is to output a noiseless version of the image. Image denoising is considered as special case of Eq. 1 where $k$ is a delta function with no shift.

Experiment: We train on 400 images of size $180\times 180$ from the Berkeley Segmentation Dataset (BSD) [3]. We set the patch size as $64\times 64$ , and crop about one million random patches to train. We train four models as described in [51]. Three of these models are trained on images with three different levels of Gaussian noise i.e., $\sigma$ = 15, 25 and 50. We refer to these models as PSN-K (Proximal Split Net-Known noise level). The fourth model is trained for blind Gaussian denoising, where no level of $\sigma$ is assumed. For blind Gaussian denoising, we train a single model and set the range of the noise level in the training images to be $\sigma\in$ [0, 60]. We refer to these models as PSN-U (Unknown noise level). We test on two well known datasets, the Berkeley Segmentation Dataset (BSD68) [38] containing a total of 68 images and Set12 [11] with 12 images with no overlap during training. We compare our approach with several state-of-the-art methods such as BM3D [11], WNNM [16], TRND [10], EPLL [52], DnCNN [51], MLP [8] and CSF [40].

Results: Table 1 showcases the testing PSNR results on BSD68. We observe that PSN-K outperforms all other algorithms to obtain a new state-of-the-art on BSD68. However, the noise-blind version (PSN-U) very closely matches the previous state-of-the-art and for $\sigma=25,50$ outperform it. Table 2 shows the testing PSNRs for Set12. We find that for most images, PSN-K achieves new state-of-the-arts. The noise-blind model PSN-U also beats the state-of-art on many images in some cases even PSN-K. PSN-U performs particularly well at high levels of noise i.e. $\sigma=50$ . Fig. 1 and Fig. 5 present some qualitative results illustrating the high level of detail PSN recovers. More results are presented in the supplementary.

4.2 Image Super-Resolution

Our second task aims to reconstruct a high-resolution image from a single low-resolution image . Image super-resolution is considered as special case of Eq. 1 where $k$ is a bicubic down sampling filter with no added noise.

Experiment: For training, we use DIV2K dataset. The dataset consists of 800 training images (2K resolution).The data set is augmented with random horizontal flips and $90^{\circ}$ rotations. We set the high resolution patch size to be $128\times 128$ . The low res patches are generated via bicubic down sampling of the high resolution patches. We trained a single model for each of three different scales i.e., 2X, 3X and 4X. We test our algorithm on four benchmark datasets. The datasets are Set5 [6], Set14 [50], BSDS100 [4] and URBAN100 [20] . We compare our approach with several state-of-the-art methods such as A+ [45], RFL [42], SelfExSR [20], SRCNN [12], FSRCNN [13], SCN [48], DRCN [25], LapSRN [28] and VDSR [24] in terms of the PSNR and SSIM metrics as in [24].

Results: From Table. 5, it is clear that PSN achieves state-of-the-art results both in terms of PSNR and SSIM for all four benchmarks for all scales by a significant margin. This demonstrates the efficacy of the algorithm in application to the image super-resolution problem. Fig. 2 and Fig. 5 present some qualitative results. Notice that PSN recovers complex structures more clearly. More results are presented in the supplementary.

4.3 Conclusion

We proposed a theoretically motivated novel deep architecture for image recovery, inspired from the half quadratic algorithm and utilizing the proximal operator. Extensive experiments in image denoising and image super resolution demonstrated the proposed Proximal Splitting Network is effective and achieves a new state-of-the-art on both tasks. Furthermore, the proposed framework is flexible and can be potentially applied to other tasks such as image in-painting and compressed sensing, which are left to be explored in future work.

5 Appendix: Algorithms as Special Cases of the Proximal Splitting Network Optimization Problem

VDSR [24] as a special case. We now describe the relationship of Proximal Splitting Networks to some of the other deep learning approaches. The authors in [24] present a single-image super-resolution method called VDSR. In this work, they use a very deep convolutional network with residual-learning. We find that VDSR is special case of our formulation. VDSR can be modelled by modifying Eq. 12 (from the paper). We set $S=1$ , $K^{T}$ to be the bi-cubic up-sampling filter, $\beta$ to be 2 and where $y$ is the low res image and $x^{0}$ the up-sampled low res image via bi-cubic interpolation. Further, the last convolution layer of $\Gamma^{1}_{\Theta}$ is a filter that can be represented by a matrix with weight equivalents to $(I+K^{T}K)^{-1}$ and the loss objective being the $l_{2}$ loss, we will have the following optimization problem:

[TABLE]

Here, $\Gamma^{1}_{\Theta}$ is modelled to be a deep CNN that consists of 20 layers. This formulation is the exact formulation of the VDSR method.

DnCNN [51] as a special case. Similarly, the Denoising Convolutional Neural Networks (DnCNN [51]) can also be modelled as a special case of the PSN optimization problem (Eq. 12 from the paper). In this work the authors propose a CNN for image denoising. DnCNN model has the ability to recover images when the noise level is unknown. DnCNN can be represented by the formula in Eq. 13 if $\Gamma^{1}_{\Theta}$ is a deep convolutional neural network, $x^{0}$ is the noisy image and $K^{T}$ being the identity matrix. The formulation then describes the architecture of DnCNN.

Thus, we find that some previous approaches can be modelled as special cases of our formulation (Eq. 12 from the paper). Our approach we find, not only theoretically generalizes these methods, but also outperforms them practically on two image restoration tasks.

Furthermore, deep multi-scale convolutional neural network for dynamic scene deblurring [31] is another special case of our approach. In this work, they proposed a blind deblurring method with CNN. the proposed network is a multi-scale convolutional neural network that recovers sharp images where blur is caused by several motion filters. To show that this approach is special case of our method, we need to present the formula of combing PSN with multi scale architecture first. The optimization formula of this combination is:

[TABLE]

Where $D^{T}$ is a de-convolution filter[33]. $M_{t}$ is sub sampling matrix that reduce the size of the vector that is multiplied with. $M_{S}$ is the identify matrix, $M_{i}$ is a down sampling matrix by $2^{S-i}$ .

To show that the proposed approach in [31] is special case of PSN ,we need to manipulate with the value of $\beta$ and $K^{T}$ in Eq. 14 since the filter $k$ is unknown. Thus, it can be written as:

[TABLE]

Optimizing this function is exactly equivalent to the approach in [31] when $\Gamma^{t}_{\Theta}$ is a deep residual network [17]

By applying the same methodology that we used in the previous three cases, we can show that our approach is general method of [28, 49] too.

6 Proof of Eq. 4

In Eq. 3 (from the paper) , $h(z)$ can be approximated via the second order of Taylor series since it is twice differentiable. Thus, the optimization problem in Eq. 3 (from the paper) will be

[TABLE]

Eq. 16 is convex. Therefore, Minimizing Eq. 16 can be found by taking the first derivative and computing the roots(when the function equals zero). The result will be:

[TABLE]

when $\beta$ is large, the proximal of function $h$ can be approximated to be

[TABLE]

7 Complexity

Table 4 shows the run times of different methods for denoising. The input images have three different sizes( $256\times 256$ , $512\times 512$ and $1024\times 1024$ . We see that the two versions of the PSN network, PSN-K and PSN-U are one of the fastest algorithms with less than 0.1 seconds for images less than $512\times 512$ . Though it is slightly slower than the networks of TNRD [10] and DnCNN [51], it still processes faster than 0.5 seconds for a $1024\times 1024$ image.

8 Complete Tabular result for Super-resolution

Table. 5 shows the full version of Table. 3 from the main paper. This is the complete result for the super-resolution experiments. We find that PSN still achieves state-of-the-art results on most benchmarks and settings.

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Altintaç, E. E. Altshuler, J. B. Andersen, M. Ando, E. Arvas, R. Raird, L. A. Baker, B. B. Balslcy, W. L. Ecklund, D. A. Bathker, et al. 1988 index ieee transactions on antennas and propagation.
2[2] M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies. Image coding using wavelet transform. IEEE Transactions on image processing , 1(2):205–220, 1992.
3[3] P. Arbelaez, C. Fowlkes, and D. Martin. The berkeley segmentation dataset and benchmark. see http://www. eecs. berkeley. edu/Research/Projects/CS/vision/bsds , 2007.
4[4] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence , 33(5):898–916, 2011.
5[5] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences , 2(1):183–202, 2009.
6[6] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. 2012.
7[7] S. A. Bigdeli, M. Zwicker, P. Favaro, and M. Jin. Deep mean-shift priors for image restoration. In Advances in Neural Information Processing Systems , pages 763–772, 2017.
8[8] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete with bm 3d? In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on , pages 2392–2399. IEEE, 2012.