GAN-based Projector for Faster Recovery with Convergence Guarantees in   Linear Inverse Problems

Ankit Raj; Yuqi Li; Yoram Bresler

arXiv:1902.09698·cs.LG·October 25, 2019

GAN-based Projector for Faster Recovery with Convergence Guarantees in Linear Inverse Problems

Ankit Raj, Yuqi Li, Yoram Bresler

PDF

TL;DR

This paper introduces a GAN-based projector for linear inverse problems that accelerates recovery by 60-80 times, guarantees convergence under certain conditions, and reduces measurement requirements by 5-10 times, applicable across various tasks.

Contribution

It proposes a network-based projector for PGD that speeds up GAN-based recovery, with theoretical convergence guarantees and a method for designing measurement matrices, applicable to multiple inverse problems.

Findings

01

Achieves 60-80x faster recovery than previous GAN methods.

02

Requires 5-10x fewer measurements for similar accuracy.

03

Provides convergence guarantees under moderate conditioning.

Abstract

A Generative Adversarial Network (GAN) with generator $G$ trained to model the prior of images has been shown to perform better than sparsity-based regularizers in ill-posed inverse problems. Here, we propose a new method of deploying a GAN-based prior to solve linear inverse problems using projected gradient descent (PGD). Our method learns a network-based projector for use in the PGD algorithm, eliminating expensive computation of the Jacobian of $G$ . Experiments show that our approach provides a speed-up of $60 - 80 \times$ over earlier GAN-based recovery methods along with better accuracy. Our main theoretical result is that if the measurement matrix is moderately conditioned on the manifold range( $G$ ) and the projector is $δ$ -approximate, then the algorithm is guaranteed to reach $O (δ)$ reconstruction error in $O (l o g (1/ δ))$ steps in the low noise regime.…

Tables1

Table 1. Table 1 : Comparison of execution time ([sec.]) of recovery algorithms on the CelebA dataset. The relative speedup of our NPGD over the CSGM algorithm of Bora et al. is shown in parenthesis.

$m$	CSGM ³³3Run time includes 2 initializations, as implemented by the authors, for CelebA. The same number of initializations for CelebA (and 10 for MNIST) has been used to produce results in figures 2, 7, 8, and 9. Our NPGD algorithm uses only one, deterministic initialization, $x_{0} = A^{T} y$ .	PGD-GAN	NPGD
200	5.8	66	0.09 (64x)
500	6.6	60	0.10 (66x)
1000	8.0	63	0.11 (72x)
2000	11.2	61	0.14 (80x)

Equations50

y = A x + v, A \in R^{m \times n}, v \sim N (0, σ^{2} I)

y = A x + v, A \in R^{m \times n}, v \sim N (0, σ^{2} I)

\overset{x}{^}_{M L E}

\overset{x}{^}_{M L E}

L (θ)

L (θ)

+ E_{z, ν} [λ G_{θ}^{†} (G (z) + ν) - z^{2}]

α ∥ x_{1} - x_{2} ∥^{2} \leq ∥ A (x_{1} - x_{2}) ∥^{2} \leq β ∥ x_{1} - x_{2} ∥^{2} .

α ∥ x_{1} - x_{2} ∥^{2} \leq ∥ A (x_{1} - x_{2}) ∥^{2} \leq β ∥ x_{1} - x_{2} ∥^{2} .

∥ x - G (G^{†} (x)) ∥^{2} \leq z \in R^{k} min ∥ x - G (z) ∥^{2} + δ

∥ x - G (G^{†} (x)) ∥^{2} \leq z \in R^{k} min ∥ x - G (z) ∥^{2} + δ

\mathcal{S}(G)=\Big{\{}\frac{x_{1}-x_{2}}{\|x_{1}-x_{2}\|}:x_{1},x_{2}\in R(G)\Big{\}}

\mathcal{S}(G)=\Big{\{}\frac{x_{1}-x_{2}}{\|x_{1}-x_{2}\|}:x_{1},x_{2}\in R(G)\Big{\}}

z_{1}, z_{2} \sim N (0, I_{k}), s = \frac{G ( z _{1} ) - G ( z _{2} )}{∥ G ( z _{1} ) - G ( z _{2} ) ∥} \sim Π_{S}

z_{1}, z_{2} \sim N (0, I_{k}), s = \frac{G ( z _{1} ) - G ( z _{2} )}{∥ G ( z _{1} ) - G ( z _{2} ) ∥} \sim Π_{S}

A \in R^{m \times n} min \frac{β}{α} = A \in R^{m \times n} min \frac{max _{s \in S (G)} ∥ A s ∥ ^{2}}{min _{s \in S (G)} ∥ A s ∥ ^{2}}

A \in R^{m \times n} min \frac{β}{α} = A \in R^{m \times n} min \frac{max _{s \in S (G)} ∥ A s ∥ ^{2}}{min _{s \in S (G)} ∥ A s ∥ ^{2}}

\displaystyle\leq\min_{AA^{T}=I_{m}}\frac{1}{\min_{s\in\mathcal{S}(G)}\left\|As\right\|^{2}}=\Big{(}\max_{AA^{T}=I_{m}}\min_{s\in\mathcal{S}(G)}\left\|As\right\|^{2}\Big{)}^{-1}

A = A A^{T} = I_{m} arg max E_{s \sim Π_{S}} [∥ A s ∥^{2}] \approx A A^{T} = I_{m} arg max \frac{1}{M} j = 1 \sum M ∥ A s_{j} ∥^{2}

A = A A^{T} = I_{m} arg max E_{s \sim Π_{S}} [∥ A s ∥^{2}] \approx A A^{T} = I_{m} arg max \frac{1}{M} j = 1 \sum M ∥ A s_{j} ∥^{2}

A^{*} = A arg max ∥ A D ∥_{F}^{2} s.t. A A^{T} = I_{m}

A^{*} = A arg max ∥ A D ∥_{F}^{2} s.t. A A^{T} = I_{m}

∥ w_{t} - x_{t + 1} ∥^{2} = ∥ w_{t} - G (G^{†} (w_{t})) ∥^{2} \leq ∥ x^{*} - w_{t} ∥^{2} + δ

∥ w_{t} - x_{t + 1} ∥^{2} = ∥ w_{t} - G (G^{†} (w_{t})) ∥^{2} \leq ∥ x^{*} - w_{t} ∥^{2} + δ

w_{t}=x_{t}-\eta A^{T}(Ax_{t}-y)=x_{t}-\eta A^{T}A(x_{t}-x^{*})\

w_{t}=x_{t}-\eta A^{T}(Ax_{t}-y)=x_{t}-\eta A^{T}A(x_{t}-x^{*})\

\begin{array}[]{ l }{\left\|x_{t+1}-x_{t}\right\|^{2}-2\eta\left\langle x_{t+1}-x_{t},A^{T}A\left(x^{*}-x_{t}\right)\right\rangle}\\ {\leq\left\|x^{*}-x_{t}\right\|^{2}-2\eta\|A(x^{*}-x_{t})\|^{2}+\delta}\end{array}

\begin{array}[]{ l }{\left\|x_{t+1}-x_{t}\right\|^{2}-2\eta\left\langle x_{t+1}-x_{t},A^{T}A\left(x^{*}-x_{t}\right)\right\rangle}\\ {\leq\left\|x^{*}-x_{t}\right\|^{2}-2\eta\|A(x^{*}-x_{t})\|^{2}+\delta}\end{array}

2 ⟨ x_{t} - x_{t + 1}, A^{T} A (x^{*} - x_{t}) ⟩

2 ⟨ x_{t} - x_{t + 1}, A^{T} A (x^{*} - x_{t}) ⟩

\leq \frac{1}{η} ∥ x^{*} - x_{t} ∥^{2} - 2 f (x_{t}) - \frac{1}{η} ∥ x_{t + 1} - x_{t} ∥^{2} + \frac{δ}{η}

\displaystyle\leq\Big{(}\frac{1}{\eta\alpha}-2\Big{)}f(x_{t})-\frac{1}{\eta}\left\|x_{t+1}-x_{t}\right\|^{2}+\frac{\delta}{\eta}

\displaystyle\leq\Big{(}\frac{1}{\eta\alpha}-2\Big{)}f\left(x_{t}\right)-\frac{1}{\eta\beta}\left\|Ax_{t+1}-Ax_{t}\right\|^{2}+\frac{\delta}{\eta}

2 ⟨ x_{t} - x_{t + 1}, A^{T} A (x^{*} - x_{t}) ⟩

2 ⟨ x_{t} - x_{t + 1}, A^{T} A (x^{*} - x_{t}) ⟩

= ∥ A x^{*} - A x_{t + 1} ∥^{2} - ∥ A x^{*} - A x_{t} ∥^{2} - ∥ A x_{t + 1} - A x_{t} ∥^{2}

= f (x_{t + 1}) - f (x_{t}) - ∥ A x_{t + 1} - A x_{t} ∥^{2}

\small f(x_{t+1})\leq\Big{(}\frac{1}{\eta\alpha}-1\Big{)}f(x_{t})+\Big{(}1-\frac{1}{\eta\beta}\Big{)}\left\|Ax_{t+1}-Ax_{t}\right\|_{2}^{2}+\frac{\delta}{\eta}

\small f(x_{t+1})\leq\Big{(}\frac{1}{\eta\alpha}-1\Big{)}f(x_{t})+\Big{(}1-\frac{1}{\eta\beta}\Big{)}\left\|Ax_{t+1}-Ax_{t}\right\|_{2}^{2}+\frac{\delta}{\eta}

f(x_{t+1})\leq\Big{(}\frac{\beta}{\alpha}-1\Big{)}f(x_{t})+\beta\delta

f(x_{t+1})\leq\Big{(}\frac{\beta}{\alpha}-1\Big{)}f(x_{t})+\beta\delta

f (x_{n})

f (x_{n})

= (κ - 1)^{n} f (x_{0}) + \frac{β ( 1 - ( κ - 1 ) ^{n} )}{2 - κ} δ

∥ x_{n} - x^{*} ∥^{2} \leq \frac{∥ A x _{n} - A x ^{*} ∥ ^{2}}{α} = \frac{f ( x _{n} )}{α}

∥ x_{n} - x^{*} ∥^{2} \leq \frac{∥ A x _{n} - A x ^{*} ∥ ^{2}}{α} = \frac{f ( x _{n} )}{α}

\leq (κ - 1)^{n} \frac{f ( x _{0} )}{α} + \frac{β ( 1 - ( κ - 1 ) ^{n} )}{α ( 2 - κ )} δ

\displaystyle\leq\left(\kappa-1\right)^{n}\frac{f(x_{0})}{\alpha}+\frac{\delta}{2/\kappa-1}\leq\Big{(}C+\frac{1}{2/\kappa-1}\Big{)}\delta

∥ x^{*} - x_{\infty} ∥^{2} \leq \frac{δ}{2/ κ - 1} = \frac{δ}{2 α / β - 1}

∥ x^{*} - x_{\infty} ∥^{2} \leq \frac{δ}{2/ κ - 1} = \frac{δ}{2 α / β - 1}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsConvolution · Dogecoin Customer Service Number +1-833-534-1729

Full text

GAN-based Projector for Faster Recovery with Convergence Guarantees in Linear Inverse Problems

Ankit Raj Yuqi Li11footnotemark: 1 Yoram Bresler

University of Illinois at Urbana-Champaign, USA

{ankitr3, yuqil3, ybresler}@illinois.edu Equal contribution. Ankit Raj and Yoram Bresler’s research work was supported in part by the National Science Foundation under Grant IIS 14-47879 . Yuqi Li and Yoram Bresler’s reseach work was supported in part by Sandia National Laboratories under Grant ID: AE056, IP: 00371547

Abstract

*A Generative Adversarial Network (GAN) with generator $G$ trained to model the prior of images has been shown to perform better than sparsity-based regularizers in ill-posed inverse problems. Here, we propose a new method of deploying a GAN-based prior to solve linear inverse problems using projected gradient descent (PGD). Our method learns a network-based projector for use in the PGD algorithm, eliminating expensive computation of the Jacobian of $G$ . Experiments show that our approach provides a speed-up of $60\text{-}80\times$ over earlier GAN-based recovery methods along with better accuracy. Our main theoretical result is that if the measurement matrix is moderately conditioned on the manifold range( $G$ ) and the projector is $\delta$ -approximate, then the algorithm is guaranteed to reach $O(\delta)$ reconstruction error in $O(log(1/\delta))$ steps in the low noise regime. Additionally, we propose a fast method to design such measurement matrices for a given $G$ . Extensive experiments demonstrate the efficacy of this method by requiring $5\text{-}10\times$ fewer measurements than random Gaussian measurement matrices for comparable recovery performance. Because the learning of the GAN and projector is decoupled from the measurement operator, our GAN-based projector and recovery algorithm are applicable without retraining to all linear inverse problems, as confirmed by experiments on compressed sensing, super-resolution, and inpainting. *

1 Introduction

Many application such as computational imaging, and remote sensing fall in the compressive sensing (CS) paradigm. CS [9, 5] refers to projecting a high dimensional, sparse or sparsifiable signal $x\in\mathbb{R}^{n}$ to a lower dimensional measurement $y\in\mathbb{R}^{m},m\ll n$ , using a small set of linear, non-adaptive frames. The noisy measurement model is:

[TABLE]

where the measurement matrix $A$ is often a random matrix. In this work, we are interested in the problem of recovering the unknown natural signal $x$ , from the compressed measurement $y$ , given the measurement matrix $A$ . Traditionally, for signal priors, natural images are considered sparse in some fixed or learnable basis [11, 8, 36, 22, 7, 38, 10, 21].

Instead of the sparse prior commonly adopted by CS literature, we turn to a learned prior. Neural network-based inverse problem solvers have been explored recently [14, 35, 31, 1, 12, 15, 25, 32, 22, 37, 26]. However, [1, 12, 15, 25] use information about the measurement matrix $A$ while training the network. Thus, their algorithms are limited to a particular set-up to solve specific inverse-problem and usually cannot solve other problems without retraining. Another line of work, [28, 29] jointly optimizes the measurement matrix and recovery algorithm, again resulting in algorithm limited to a particular inverse problem and measurement matrix. Instead, in this paper the network is trained independently of $A$ and can be generalized across different inverse problems. This aspect is shared by two other neural-network-based solvers [35, 31], however, they model the image prior only implicitly by training a denoiser or a proximal map, and perhaps for this reason appear to require massive quantity of training samples. Importantly, very little is known about why and when they perform well, as even if the learned proximal map is assumed to be exact, there is no theoretical convergence guarantee or bound on the recovery error.

In this work, we leverage the success of generative adversarial network (GAN) [13, 6, 42, 39, 3, 20] in modeling the distribution of data. Indeed, GAN-based priors for natural images have been successfully employed to solve linear inverse problems [24, 4, 33]. However, in [24], the operator $A$ is integrated into training the GAN, limiting it to a particular inverse problem. We therefore focus on the recent papers [4, 33] closest to our work, for extensive comparisons.

Bora et al.[4] do not have a guarantee on the convergence of their algorithm for solving the non-convex optimization problem, requiring several random initializations. Similarly, in [33], the inner loop uses a gradient descent algorithm to solve a non-convex optimization problem with no guarantee of convergence to a global optimum. Furthermore, the conditions imposed in [33] on the random Gaussian measurement matrix for convergence of their outer iterative loop are unnecessarily stringent and cannot be achieved with a moderate number of measurements. Importantly, both these methods require expensive computation of the Jacobian $\nabla_{z}G$ of the differentiable generator $G$ with respect to the latent input $z$ . Since computing $\nabla_{z}G$ involves back-propagation through $G$ at every iteration, these reconstruction algorithms are computationally expensive and even when implemented on a GPU they are slow.

We propose a GAN-based projection network to solve compressed sensing recovery problems using projected gradient descent (PGD). We are able to reconstruct the image even with $61\times$ compression ratio (i.e., with less than $1.6\%$ of a full measurement set) using a random Gaussian measurement matrix. The proposed approach provides superior recovery accuracy over existing methods, simultaneously with a $60\text{-}80\times$ speed-up, making the algorithm useful for practical applications. We also provide theoretical results on the convergence of the reconstruction error, given that the measurement matrix $A$ satisfies certain conditions when restricted to the range $R(G)$ of the generator. We complement the theory by proposing a method to design a measurement matrix that satisfies these sufficient conditions for guaranteed convergence. We assess these sufficient conditions for both the random Gaussian measurement matrix and the designed matrix for a given dataset. Both our analysis and experiments show that with the designed matrix, $5\text{-}10\times$ fewer measurements suffice for robust recovery. Because the training of the GAN and projector is decoupled from the measurement operator, we demonstrate that other linear inverse problems like super-resolution and inpainting can also be solved using our algorithm without retraining.

2 Problem Formulation

Let $x^{*}\in\mathbb{R}^{n}$ denote a ground truth image, $A$ a fixed measurement matrix, and $y=Ax^{*}+v\in\mathbb{R}^{m}$ the noisy measurement, with noise $v\sim\mathcal{N}(0,\sigma^{2}I)$ . We assume that the ground truth images lie in a non-convex set $S=R(G)$ , the range of generator $G$ . The maximum likelihood estimator (MLE) of $x^{*}$ , $\hat{x}_{MLE}$ , can be formulated as follows:

[TABLE]

Bora et al.[4] (whose algorithm we denote by CSGM) solve the optimization problem $\hat{z}=\operatorname*{arg\,min}_{z\in\mathbb{R}^{k}}\|y-AG(z)\|^{2}+\lambda\|z\|^{2}$ in the latent space ( $z$ ), and set $\hat{x}=G(\hat{z})$ . Their gradient descent algorithm often gets stuck at local optima. Since the problem is non-convex, the reconstruction is strongly dependent on the initialization of $z$ and requires several random initializations to converge to a good point. To resolve this problem, Shah and Hegde [33] proposed a projected gradient descent (PGD)-based method (which we call PGD-GAN) to solve (2), shown in fig.2(a). They perform gradient descent in the ambient ( $x$ )-space and project the updated term onto $R(G)$ . This projection involves solving another non-convex minimization problem (shown in the second box in fig.2(a)) using the Adam optimizer [17] for 100 iterations from a random initialization. No convergence result is given for this iterative algorithm to perform the non-linear projection, and the convergence analysis for the PGD-GAN algorithm [33] only holds if one assumes that the inner loop succeeds in finding the optimum projection.

Our main idea in this paper is to replace this iterative scheme in the inner-loop with a learning-based approach, as it often performs better and does not fall into local optima [42]. Another important benefit is that both earlier approaches require expensive computation of the Jacobian of $G$ , which is eliminated in the proposed approach.

3 Proposed Method

In this section, we introduce our methodology and architecture to train a projector using a pre-trained generator $G$ and how we use this projector to obtain the optimizer in (2).

3.1 Inner-Loop-Free Scheme

We show that by carefully designing a network architecture with a suitable training strategy, we can train a projector onto $R(G)$ , the range of the generator $G$ , thereby removing the inner-loop required in the earlier approach. The resulting iterative updates of our network-based PGD (NPGD) algorithm are shown in fig.2(b). This approach eliminates the need to solve the non-convex optimization problem in the inner-loop, which depends on initialization and requires several restarts. Furthermore, our method provides a significant speed-up by a factor of $30\text{-}40\times$ on the CelebA dataset for two major reasons: (i) since there is no inner-loop, the total number of iterations required for convergence is significantly reduced, (ii) doesn’t require computation of $\nabla G_{z}$ *i.e.*the Jacobian of the generator with respect to the input, $z$ . This expensive operation repeats back-propagation through the network for $T_{out}\times\#_{restarts}$ (for [4]) or $T_{out}\times T_{in}$ (for [33]) times, where $\#_{restarts},T_{out}\text{ and }T_{in}$ are number of restarts, outer and inner iterations respectively.

3.2 Generator-based Projector

A GAN consists of two networks, generator and discriminator, which follow an adversarial training strategy to learn the data distribution. A well-trained generator $G:\mathbb{R}^{k}\rightarrow R(G)\subset\mathbb{R}^{n},k\ll n$ takes in a random latent variable $z\sim\mathcal{N}(0,I_{k})$ and produces sharp looking images imitating the training data distribution in $\mathbb{R}^{n}$ . The goal is to train a network that projects an image $x\in\mathbb{R}^{n}$ onto $R(G)$ . The projector, $P_{S}$ onto a set $S$ should satisfy two main properties: $(i)$ Idempotence, for any point $x$ , $P_{S}(P_{S}(x))=P_{S}(x)$ , $(ii)$ Least distance, for a point $\tilde{x}$ , $P_{S}(\tilde{x})={\arg\min}_{x\in S}\|x-\tilde{x}\|^{2}$ . Figure 3 shows the network structure we used to train a projector using a GAN. We define the multi-task loss to be:

[TABLE]

where $G$ is a generator obtained from the GAN trained on a particular dataset. Operator $G^{\dagger}_{\theta}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{k}$ , parameterized by $\theta$ , approximates a non-linear least squares pseudo-inverse of $G$ and $\nu\sim\mathcal{N}(0,I_{n})$ indicates noise added to the generator’s output for different $z\sim\mathcal{N}(0,I_{k})$ so that the projector network denoted by $P_{G}=GG^{\dagger}_{\theta}$ is trained on points outside the range( $G$ ) and learns to project them onto $R(G)$ . The objective function consists of two parts. The first is similar to standard Encoder-Decoder framework, however, the loss function is minimized over $\theta$ – the parameters of $G^{\dagger}$ , while keeping the parameters of $G$ (obtained by standard GAN training) fixed. This ensures that $R(G)$ doesn’t change and $P_{G}=GG^{\dagger}$ is a mapping onto $R(G)$ . The second part is used to keep $G^{\dagger}(G(z))$ close to true $z$ used to generate training image $G(z)$ . This second term can be considered a regularizer for training the projector with $\lambda$ being the regularization constant.

4 Theoretical Study

4.1 Convergence Analysis

Let $f(x)=\|Ax-y\|_{2}^{2}$ denote the loss function of projected gradient descent. Algorithm (1) describes the proposed network-based projected gradient descent (NPGD) to solve equation (2).

Definition 1 (Restricted Eigenvalue Constraint (REC))

Let $S\subset\mathbb{R}^{n}$ . For some parameters $0<\alpha<\beta$ , matrix $A\in\mathbb{R}^{m\times n}$ is said to satisfy the $REC(S,\alpha,\beta)$ if the following holds for all $x_{1},x_{2}\in S$ .

[TABLE]

Definition 2 (Approximate Projection using GAN)

A concatenated network $G(G^{\dagger}(\cdot)):\mathbb{R}^{n}\rightarrow R(G)$ is a $\delta$ -approximate projector, if the following holds for all $x\in\mathbb{R}^{n}$ :

[TABLE]

Theorem 1 provides upper bounds on the cost function and reconstruction error of our NPGD algorithm after $n$ iterations.

Theorem 1

Let matrix $A\in\mathbb{R}^{m\times n}$ satisfy the $REC(S,\alpha,\beta)$ with $\beta/\alpha<2$ , and let the concatenated network $G(G^{\dagger}(\cdot))$ be a $\delta$ -approximate projector. Then for every $x^{*}\in R(G)$ and measurement $y=Ax^{*}$ , executing algorithm 1 with step size $\eta=1/\beta$ , will yield $f(x_{n})\leq(\frac{\beta}{\alpha}-1)^{n}f(x_{0})+\frac{\beta\delta}{2-\beta/\alpha}$ . Furthermore, the algorithm achieves $\|x_{n}-x^{*}\|^{2}\leq\big{(}C+\frac{1}{2\alpha/\beta-1}\big{)}\delta$ after $\frac{1}{2-\beta/\alpha}\log\big{(}\frac{f(x_{0})}{C\alpha\delta}\big{)}$ steps. When $n\rightarrow\infty$ , $\|x^{*}-x_{\infty}\|^{2}\leq\frac{\delta}{2\alpha/\beta-1}$ .

Proof 1

Please refer to the appendix.

From theorem 1, one important factor is the ratio $\beta/\alpha$ . This ratio largely determines the speed of linear (”geometric”) convergence, as well as the reconstruction error $\|x^{*}-x_{\infty}\|^{2}$ at convergence. We would like $\beta/\alpha$ ratio as close to 1 as possible and must have $\beta/\alpha<2$ for convergence. It has been shown in [2] that a random matrix $A$ with orthonormal rows will satisfy this condition with high probability for $m$ roughly linear in dimension $k$ with log factors dependent on the properties of the manifold, in this case, $R(G)$ . However, as we demonstrate later (see figure 4), a random matrix often will not satisfy the desired condition $\beta/\alpha<2$ for small or moderate $m$ . To extend into such regimes, we propose next a fast heuristic method to find a relatively good measurement matrix for an image set $S$ , given a fixed $m$ .

4.2 Generator-based Measurement Matrix Design

There have been a few attempts to optimize the measurement matrix based on the specific data distribution. Hegde et al.[16] find a deterministic measurement matrix that satisfies $REC(S,1-\delta_{S},1+\delta_{S})$ for a given finite set $S$ of size $|S|$ , but their time complexity is $O(n^{3}+|S|^{2}n^{2})$ . Because the secant set $S$ (defined later) would be of cardinality $|S|=O(M^{2})$ for a training set of size $M$ , with $M\gg n$ , the time complexity would be infeasible even for fairly small $n$ -pixel images. Furthermore, the final number of required measurements $m$ , which is determined by the algorithm, depends on the isometry constant $\delta_{S}$ , and cannot be specified in advance. Kvinge et al.[18] introduced a heuristic iterative algorithm to find a measurement matrix with orthonormal rows that satisfies the REC with small $\beta/\alpha$ ratio, but their time complexity is $O\left(n^{5}\right)$ and the space complexity is $O(n^{3})$ , which is infeasible for a high-dimensional image dataset. Instead, our method, based on sampling from the secant set, has time complexity $O(Mn^{2}+n^{3})$ , and space complexity $O(n^{2})$ , where $M$ is a tiny fraction of $|S|$ .

Definition 3 (Secant Set)

The normalized secant set of $G$ is defined as follows:

[TABLE]

and the associated distribution is denoted as $\Pi_{S}$ , where

[TABLE]

Given $\mathcal{S}(G)$ , the optimization over A is as follows:

[TABLE]

The inequality is due to an additional constraint on $A:AA^{T}=I_{m}$ . This results in the largest singular value of $A$ being 1 and hence the numerator term, $\max_{s\in\mathcal{S}(G)}\left\|As\right\|^{2}$ , is at most 1. As the minimization in (7) requires iterating through the set $S$ , we use the expected value over $s\sim\Pi_{S}$ as a surrogate objective

[TABLE]

The last approximation replaces the surrogate objective by its empirical estimate obtained by sampling $M\gg n$ secants $(s_{j})_{j=1}^{M}$ according to $\Pi_{S}$ . For $m$ and $M$ large enough, this designed measurement matrix would satisfy the condition $\beta/\alpha<2$ for most of the secants in $R(G)$ . Constructing an $n\times M$ matrix $D=[s_{1}|s_{2}|\dots|s_{M}]$ , (8) reduces to:

[TABLE]

The optimal $A^{*}$ in (9) has rows equal to the $m$ leading eigenvectors $DD^{T}$ . We compute $DD^{T}=\sum_{j=1}^{M}s_{j}s_{j}^{T}$ and its eigenvalue decomposition at time complexity $O(Mn^{2}+n^{3})$ and space complexity $O(n^{2})$ .

Our approach to the design of $A$ is related to one of the steps described by [18], however by using the sampling-based estimates per (6) and (8) rather than the secant set for the entire training set, we reduce the computational cost by orders of magnitude to a modest level.

4.2.1 REC Histogram for $A$

We analyze the $REC$ conditions by plotting the histogram of $\|As\|$ values for different measurement matrices $A\in R^{m\times n}$ in figure 4 where $s\in S$ , the secant set of the samples from $G$ trained on MNIST dataset. The left column shows the histograms for the random and $G$ -based designed matrix. For random $A$ , the spread of $\|As\|$ is clearly wider for few measurements $m$ , resulting in $\beta/\alpha\not<2$ . For the designed $A$ , the histogram is more concentrated. Even with as few as $m=20$ measurements (for MNIST), the designed $A$ satisfies the sufficient condition $\beta/\alpha<2$ for convergence of the PGD algorithm, thus ensuring stable recovery. The middle columns shows the histograms corresponding to the downsampling $A$ that takes the spatial averages of $f\times f$ , $f=2,3,4,5$ , pixel values to generate low-resolution images. The right column shows the histograms for the inpainting $A$ that masks out a centered square of various sizes. As expected, with more difficult recovery problems the spread increases. However, for each inverse problem (defined by a matrix $A$ ), the ratio $\beta/\alpha$ can be estimated for e.g., 99.9% of the samples, providing, in combination with Theorem 1, an explicit quantitative guarantee.

5 Experiments

Network Architecture: We implement two GAN architectures: $(i)$ Deep convolutional GAN (DCGAN) [30] for MNIST and CelebA, $(ii)$ Self-attention GAN (SAGAN) [41] for LSUN church-outdoor dataset. DCGAN builds on multiple convolution, transpose convolution, and ReLU layers, and uses batch normalization and dropout for better generalization, whereas SAGAN combines convolutions with self-attention mechanisms in both, generator and discriminator, allowing for long-range dependency modeling to generate images with high-resolution details. For DCGAN, we have used standard objective function of the adversarial loss, whereas for SAGAN, we minimized the hinge version of the adversarial loss [27]. The architecture of the model $G^{\dagger}$ is similar to that of the discriminator $D$ in the GAN and only differs in the final layer, where we add a fully-connected layer with output size same as the latent variables dimension $k$ . For training $G^{\dagger}$ , we used the architecture shown in Fig. 3 and objective defined in (2), while keeping the pre-trained $G$ fixed. We found that using $\lambda=0.1$ , in (2), gave the best performance. The noise $\nu$ used for perturbing the training images $G(z)$ follows $\mathcal{N}(0,\sigma^{2}I)$ . We observed that training with low $\sigma$ results in a projector similar to an identity operator and hence only projecting close-by points onto $R(G)$ , whereas for large $\sigma$ the projector violates idempotence. We empirically set $\sigma=1$ . We then obtain a projection network $P_{G}=GG^{\dagger}$ that approximately projects images lying outside $R(G)$ onto $R(G)$ . We empirically pick latent variable dimension $k=100$ .

MNIST dataset [19] consists of $28\times 28$ greyscale images of digits with $50,000$ training and $10,000$ test samples. We pre-train the GAN consisting of $4$ transposed convolution layers for $G$ and $4$ convolution layers in the discriminator $D$ using rescaled images lying between $[-1,1]$ . We use $z\sim\mathcal{N}(0,I_{k})$ as the $G$ ’s input. The GAN is trained using the Adam optimizer with learning rate $0.0001$ , mini-batch size of $128$ for $40$ epochs. For training the pseudo-inverse of $G$ i.e. $G^{\dagger}$ , we minimize the objective (2), using samples generated from $G(z)$ , and with the same hyper-parameters used for the GAN.

CelebA dataset [23] consists of more than $200,000$ celebrity images. We use the aligned and cropped version, which preprocesses each image to a size of $64\times 64\times 3$ and scaled between $[-1,1]$ . We randomly pick $160,000$ images for training the GAN. Images from the $40,000$ held-out set are used for evaluation. The GAN consists of $5$ transposed convolution layers in the $G$ and $5$ convolution layers in $D$ . GAN is trained for $35$ epochs using Adam optimizer with learning rate $0.00015$ and mini-batch size $128$ . $G^{\dagger}$ is trained in the same way as for the MNIST dataset.

LSUN church-outdoor dataset [40] consists of more than $126,000$ cropped and aligned images of size $64\times 64\times 3$ scaled between $[-1,1]$ . DCGAN generates high-resolution details using spatially local points in lower-resolution feature maps, whereas in SAGAN, details can be generated using information from many feature locations making it a natural choice for diverse dataset such as LSUN. The SAGAN consists of $4$ transposed convolution layers and $2$ self-attention modules at different scales in $G$ and $4$ convolution layers and $2$ self-attention modules in $D$ . Each self-attention module consists of 3 convolution layers and are added at the $3rd$ and $4th$ layers of the two networks. SAGAN uses conditional batch normalization in $G$ and projection in $D$ . Spectral normalization is used for the layers in both $G$ and $D$ . We use ADAM optimizer with $\beta_{1}=0$ and $\beta_{2}=0.9$ , learning rate $0.0001$ and mini-batch size $64$ for the GAN training. $G^{\dagger}$ , consisting of self-attention mechanism similar to $D$ , is trained using the objective 2 using the ADAM optimizer with $\beta_{1}=0.9$ and $\beta_{2}=0.999$ , learning rate $0.001$ and mini-batch size of $64$ for $100$ epochs.

We compare the performance of our algorithm on MNIST and CelebA with other GAN-prior solvers ([4, 33]) and sparsity-based methods, Lasso with discrete cosine transform (DCT) basis [34] and total variation minimization method (TVAL3) [21] for linear inverse problems, namely compressed sensing (CS), super-resolution and inpainting. For CS, we extensively evaluate the reconstruction performance with the random Gaussian and designed measurement matrices. Furthermore, we demonstrate the recovery of LSUN church-outdoor dataset images using the proposed method for the different problems in Fig. 5.

5.1 Compressed Sensing

5.1.1 Recovery with random Gaussian matrix

In this set-up, we use the same measurement matrix $A$ as ([4, 33]) i.e. $A_{i,j}\sim N(0,1/m)$ where $m$ is the number of measurements. For MNIST, the measurement matrix $A\in R^{m\times 784}$ , with $m=20,50,100,200$ , whereas for CelebA, $A\in R^{m\times 12288}$ , with $m=200,500,1000,2000$ . Figure 2 shows the recovery results for MNIST images from the test set. Our NPGD algorithm performs better than others and avoids local optima. Figure 7 shows the reconstruction of eight test images from CelebA. Our algorithm outperforms the other three methods visually as it is able to preserve detailed facial features such as sunglasses, hair and has accurate color tones. Figures 8(a) and 8(c) provide a quantitative comparison for MNIST and CelebA, respectively.

5.1.2 Recovery with the designed matrix

In this set-up, we use the $G$ -based designed $A$ described in the section 4.2. We observe that recovery with the designed $A$ is possible for much fewer measurements $m$ . This corroborates our assessment based on Figure 4 that the designed matrix satisfies the desired REC condition with high probability for most of the secants, for smaller $m$ . Figures 8(a), 8(c) show that our algorithm consistently outperforms other approaches in terms of reconstruction error and structural similarity index (SSIM) for a random $A$ . Furthermore, with the designed $A$ , we are able to get performance on-par with the random matrix using $5\text{-}10\times$ smaller $m$ . Figures 8(b),8(d) show the recovered images with the designed and a random $A$ using our algorithm for different $m$ . Clearly, recovery with the random $A$ requires much bigger $m$ than the designed one to achieve similar performance.

5.2 Super-resolution

Super-resolution refers to recovering the high-resolution image from a single low-resolution image, often modeled as a blurred and downsampled image of the original. This super-resolution problem is just a special case in our framework of linear measurements. We simulate the blurring+downsampling by taking the spatial averages of $f\times f$ pixel values (in RGB color space), where $f$ is the ratio of downsampling. This corresponds to blurring by an $f\times f$ box impulse response, followed by downsampling. We test our algorithm with $f=2,3,4$ , corresponding to $4\times$ , $9\times$ and $16\times$ -smaller image sizes, respectively. We note that for higher $f$ , the measurement matrix $A$ may not satisfy the desired $REC(S,\alpha,\beta)$ with $\frac{\beta}{\alpha}<2$ (see figure 4) required for convergence of our algorithm and, consequently, our theorem might not be applicable. Results for MNIST in figure 9(a)-9(c) shows that recovery performance indeed degrades with increasing $f$ , however, our NPGD algorithm, gives better reconstructions than Bora et al.[4].

5.3 Inpainting

Inpainting refers to recovering the entire image from a partly occluded version. In this case, $y$ is an image with masked regions and $A$ is the linear operation applying a pixel-wise mask to the original image $x$ . Again, this is a special case of linear measurements where each measurement corresponds to an observed pixel. For experiments on the MNIST dataset, we apply a centered square mask of size $6,10,14$ . Recovery results in figure 10(a)-10(c) show that our method consistently outperforms [4] and recovers almost perfectly for mask-size less than $10$ . The results align with the $REC$ histogram for inpainting (figure 4), which shows that for higher mask-size, the desired $REC$ condition for guaranteed convergence may not be satisfied.

5.4 Comparison of Run-time for Recovery

Table 1 compares the run times of our network-based algorithm NPGD and other recovery algorithms. We record the average run time to recover a single image from its compressed sensing measurements over 10 different images. All three algorithms were run on the same workstation with i7-4770K CPU, 32GB RAM and GeForce Titan X GPU.

5.5 Analysis: Error in Projector

Figure 11 illustrates the idempotence error of the projector for different $k$ . Three different categories of images are tested, namely, MNIST training samples, MNIST test samples, and samples $G(z)$ generated using the pre-trained $G$ . We use clean images from the three sources and plot the relative idempotence error $\|x-P_{G}(x)\|^{2}/\|x\|^{2}$ . The error decreases with increasing $k$ and saturates around $k=100$ . The idemopotence errors for MNIST training and test samples are very close, indicating negligible generalization error. On the other hand, samples generated by $G(z)$ give much lower errors, which indicates representation error in the GAN. Thus we expect that a more flexible generator (deeper network) will lead to a better projector on the actual dataset and hence improve performance.

6 Conclusion

In this work, we propose a GAN based projection network for faster recovery in linear inverse problems. Our method demonstrates superior performance and also provides a speed-up of $60\text{-}80\times$ over existing GAN-based methods, eliminating the expensive computation of the Jacobian matrix every iteration. We provide a theoretical bound on the reconstruction error for a moderately-conditioned measurement matrix. To help design such a matrix for compressed sensing, we propose a method which enables recovery using $5\text{-}10\times$ fewer measurements than using a random Gaussian matrix. Our experiments on compressed sensing, super-resolution, and inpainting demonstrate that generic linear inverse problems can be solved with the proposed method without requiring retraining. In the future, deriving a bound for the projection error $\delta$ and an associated performance guarantee is a interesting direction.

Appendix A Appendix: Proof of Theorem 1

By the assumption of $\delta$ -approximate projection,

[TABLE]

where from the gradient update step, we have

[TABLE]

Substituting $w_{t}$ into (10) yields

[TABLE]

Rearranging the terms we have

[TABLE]

where the last two inequalities follow from $REC(S,\alpha,\beta)$ . Now the LHS can be rewritten as:

[TABLE]

Combining (11) and (12), and rearranging the terms, we have:

[TABLE]

and since $\eta=1/\beta$ ,

[TABLE]

For simplicity, we substitute $\kappa=\beta/\alpha$ in the following:

[TABLE]

For convergence, we require $1\leq\kappa=\beta/\alpha<2$ . When $n$ reaches $\frac{1}{2-\kappa}\log\Big{(}\frac{f(x_{0})}{C\alpha\delta}\Big{)}$ , we have

[TABLE]

Finally, when $n\rightarrow\infty$ , we have $\left(\kappa-1\right)^{n}\frac{f(x_{0})}{\alpha}\rightarrow 0$

[TABLE]

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Jonas Adler and Ozan Öktem. Solving ill-posed inverse problems using iterative deep neural networks. Inverse Problems , 33(12):124007, 2017.
2[2] Richard G Baraniuk and Michael B Wakin. Random projections of smooth manifolds. Foundations of computational mathematics , 9(1):51–77, 2009.
3[3] David Berthelot, Thomas Schumm, and Luke Metz. Began: Boundary equilibrium generative adversarial networks. ar Xiv preprint ar Xiv:1703.10717 , 2017.
4[4] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using generative models. ar Xiv preprint ar Xiv:1703.03208 , 2017.
5[5] Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences , 59(8):1207–1223, 2006.
6[6] Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. Generative adversarial networks: An overview. IEEE Signal Processing Magazine , 35(1):53–65, 2018.
7[7] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Bm 3d image denoising with shape-adaptive principal component analysis. In SPARS’09-Signal Processing with Adaptive Sparse Structured Representations , 2009.
8[8] Weisheng Dong, Lei Zhang, Guangming Shi, and Xiaolin Wu. Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization. IEEE Transactions on Image Processing , 20(7):1838–1857, 2011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

GAN-based Projector for Faster Recovery with Convergence Guarantees in Linear Inverse Problems

Abstract

1 Introduction

2 Problem Formulation

3 Proposed Method

3.1 Inner-Loop-Free Scheme

3.2 Generator-based Projector

4 Theoretical Study

4.1 Convergence Analysis

Definition 1** (Restricted Eigenvalue Constraint (REC))**

Definition 2** (Approximate Projection using GAN)**

Theorem 1

Proof 1

4.2 Generator-based Measurement Matrix Design

Definition 3** (Secant Set)**

4.2.1 REC Histogram for AAA

5 Experiments

5.1 Compressed Sensing

5.1.1 Recovery with random Gaussian matrix

5.1.2 Recovery with the designed matrix

5.2 Super-resolution

5.3 Inpainting

5.4 Comparison of Run-time for Recovery

5.5 Analysis: Error in Projector

6 Conclusion

Appendix A Appendix: Proof of Theorem 1

Definition 1 (Restricted Eigenvalue Constraint (REC))

Definition 2 (Approximate Projection using GAN)

Definition 3 (Secant Set)

4.2.1 REC Histogram for $A$