Deep Coupled-Representation Learning for Sparse Linear Inverse Problems   with Side Information

Evaggelia Tsiligianni; Nikos Deligiannis

arXiv:1907.02511·cs.LG·January 8, 2020

Deep Coupled-Representation Learning for Sparse Linear Inverse Problems with Side Information

Evaggelia Tsiligianni, Nikos Deligiannis

PDF

TL;DR

This paper introduces a novel deep unfolding method that leverages side information from different modalities to improve the recovery of signals in linear inverse problems, achieving better performance with lower computational cost.

Contribution

It presents the first deep unfolding approach incorporating cross-modality side information for sparse linear inverse problems.

Findings

01

Outperforms single-modal deep learning methods without SI

02

Surpasses multimodal deep learning designs without unfolding

03

Achieves superior reconstruction quality with reduced computational complexity

Abstract

In linear inverse problems, the goal is to recover a target signal from undersampled, incomplete or noisy linear measurements. Typically, the recovery relies on complex numerical optimization methods; recent approaches perform an unfolding of a numerical algorithm into a neural network form, resulting in a substantial reduction of the computational complexity. In this paper, we consider the recovery of a target signal with the aid of a correlated signal, the so-called side information (SI), and propose a deep unfolding model that incorporates SI. The proposed model is used to learn coupled representations of correlated signals from different modalities, enabling the recovery of multimodal data at a low computational cost. As such, our work introduces the first deep unfolding method with SI, which actually comes from a different modality. We apply our model to reconstruct near-infrared…

Tables2

Table 1. Table I: Sparse approximation results (NMSE in dB).

	[14]	LeSITA				SITA
similarity	–	$ρ = 25$	$ρ = 20$	$ρ = 15$	$ρ = 10$	$ρ = 25$	$ρ = 15$
$T = 3$	$- 15.64$	$- 21.10$	$- 18.05$	$- 15.54$	$- 13.45$	$- 2.25$	$- 2.22$
$T = 5$	$- 21.92$	$- 27.97$	$- 24.67$	$- 21.68$	$- 18.85$	$- 2.70$	$- 2.65$
$T = 7$	$- 26.95$	$- 33.54$	$- 30.06$	$- 26.84$	$- 23.87$	$- 3.00$	$- 2.94$

Table 2. Table II: Reconstruction results on NIR images (PSNR in dB).

	Single-modal methods						Multi-modal methods
	LISTA [14]		LAMP [17]		DL [12]		Multimodal DL		LeSITA ( $ℒ_{2}^{B}$ )		LeSITA ( $ℒ_{2}^{A}$ )
CS ratio	$0.5$	$0.25$	$0.5$	$0.25$	$0.5$	$0.25$	$0.5$	$0.25$	$0.5$	$0.25$	$0.5$	$0.25$
country (0070)	$45.64$	$38.16$	$44.98$	$36.76$	$34.04$	$32.84$	$40.69$	$38.52$	$41.99$	$35.48$	$46.34$	$39.80$
field (0058)	$40.01$	$33.94$	$39.87$	$33.17$	$31.02$	$30.66$	$36.13$	$34.35$	$37.74$	$32.47$	$40.61$	$35.65$
forest (0058)	$37.82$	$31.69$	$37.69$	$30.80$	$28.50$	$28.28$	$33.91$	$32.05$	$35.49$	$29.97$	$38.54$	$34.03$
indoor (0056)	$37.18$	$32.05$	$37.05$	$31.17$	$29.08$	$28.85$	$33.84$	$32.42$	$35.19$	$30.72$	$37.90$	$34.93$
mountain (0055)	$54.33$	$53.53$	$56.13$	$51.96$	$51.12$	$45.26$	$54.62$	$53.20$	$55.45$	$52.25$	$56.79$	$53.74$
oldbuilding (0103)	$49.04$	$41.51$	$48.17$	$39.00$	$36.05$	$34.07$	$44.39$	$41.82$	$44.47$	$40.33$	$50.69$	$44.20$
street (0057)	$37.61$	$33.10$	$36.09$	$31.45$	$31.01$	$29.79$	$34.30$	$33.22$	$36.46$	$32.38$	$38.13$	$35.13$
urban (0102)	$40.00$	$32.88$	$38.78$	$31.95$	$29.55$	$29.22$	$35.11$	$33.28$	$36.47$	$31.69$	$39.69$	$35.05$
water (0083)	$46.79$	$42.50$	$47.10$	$41.25$	$38.43$	$35.47$	$44.21$	$42.69$	$45.24$	$41.35$	$47.69$	$43.58$
Average	$43.16$	$37.71$	$42.87$	$36.39$	$34.31$	$32.72$	$39.69$	$37.95$	$40.94$	$36.29$	$44.04$	$39.57$

Equations78

y = Φ x + e,

y = Φ x + e,

y = Φ D_{x} α + e,

y = Φ D_{x} α + e,

α min \frac{1}{2} ∥Φ D_{x} α - y ∥_{2}^{2} + λ ∥ α ∥_{1},

α min \frac{1}{2} ∥Φ D_{x} α - y ∥_{2}^{2} + λ ∥ α ∥_{1},

α min f (α) + λ g (α),

α min f (α) + λ g (α),

\text{prox}_{\theta g}(u)=\arg\min_{v}\big{\{}\frac{1}{2}\|v-u\|_{2}^{2}+\theta g(v)\big{\}},

\text{prox}_{\theta g}(u)=\arg\min_{v}\big{\{}\frac{1}{2}\|v-u\|_{2}^{2}+\theta g(v)\big{\}},

α^{t} = ψ_{θ} (α^{t - 1} - \frac{1}{L} F^{T} (F α^{t - 1} - y)), α^{0} = 0,

α^{t} = ψ_{θ} (α^{t - 1} - \frac{1}{L} F^{T} (F α^{t - 1} - y)), α^{0} = 0,

ψ_{θ} (u_{i}) = sign (u_{i}) (∣ u_{i} ∣ - θ)_{+}, i = 1, \dots, k,

ψ_{θ} (u_{i}) = sign (u_{i}) (∣ u_{i} ∣ - θ)_{+}, i = 1, \dots, k,

\alpha^{t}=\psi_{\theta}\big{(}S\alpha^{t-1}+Wy\big{)}.

\alpha^{t}=\psi_{\theta}\big{(}S\alpha^{t-1}+Wy\big{)}.

α min \frac{1}{2} ∥Φ D_{x} α - y ∥_{2}^{2} + λ (∥ α ∥_{1} + ∥ α - w ∥_{1}) .

α min \frac{1}{2} ∥Φ D_{x} α - y ∥_{2}^{2} + λ (∥ α ∥_{1} + ∥ α - w ∥_{1}) .

\xi_{\mu}(u)=\arg\min_{v}\big{\{}\frac{1}{2}\|v-u\|_{2}^{2}+\mu(\|v\|_{1}+\|v-w\|_{1})\big{\}},

\xi_{\mu}(u)=\arg\min_{v}\big{\{}\frac{1}{2}\|v-u\|_{2}^{2}+\mu(\|v\|_{1}+\|v-w\|_{1})\big{\}},

ξ_{μ} (u_{i}) = ⎩ ⎨ ⎧ u_{i} + 2 μ, 0, u_{i}, w_{i}, u_{i} - 2 μ, - 2 μ \leq 0 < w_{i} \leq u_{i} < - 2 μ, u_{i} \leq 0, u_{i} < w_{i}, u_{i} \leq w_{i} + 2 μ, u_{i} > w_{i} + 2 μ .

ξ_{μ} (u_{i}) = ⎩ ⎨ ⎧ u_{i} + 2 μ, 0, u_{i}, w_{i}, u_{i} - 2 μ, - 2 μ \leq 0 < w_{i} \leq u_{i} < - 2 μ, u_{i} \leq 0, u_{i} < w_{i}, u_{i} \leq w_{i} + 2 μ, u_{i} > w_{i} + 2 μ .

ξ_{μ} (u_{i}) = ⎩ ⎨ ⎧ u_{i} + 2 μ, w_{i}, u_{i}, 0, u_{i} - 2 μ, w_{i} - 2 μ \leq w_{i} < 0 \leq u_{i} < w_{i} - 2 μ, u_{i} \leq w_{i}, u_{i} < 0, u_{i} \leq 2 μ, u_{i} > 2 μ .

ξ_{μ} (u_{i}) = ⎩ ⎨ ⎧ u_{i} + 2 μ, w_{i}, u_{i}, 0, u_{i} - 2 μ, w_{i} - 2 μ \leq w_{i} < 0 \leq u_{i} < w_{i} - 2 μ, u_{i} \leq w_{i}, u_{i} < 0, u_{i} \leq 2 μ, u_{i} > 2 μ .

α^{t} = ξ_{μ} (α^{t - 1} - \frac{1}{L} F^{T} (F α^{t - 1} - y)), α^{0} = 0.

α^{t} = ξ_{μ} (α^{t - 1} - \frac{1}{L} F^{T} (F α^{t - 1} - y)), α^{0} = 0.

\alpha^{t}=\xi_{\mu}\big{(}Q\alpha^{t-1}+Ry\big{)}.

\alpha^{t}=\xi_{\mu}\big{(}Q\alpha^{t-1}+Ry\big{)}.

L = j = 1 \sum J ∥ α_{(j)} - \overset{α}{^}_{(j)} ∥_{2}^{2},

L = j = 1 \sum J ∥ α_{(j)} - \overset{α}{^}_{(j)} ∥_{2}^{2},

L = λ_{1} L_{1} + λ_{2} L_{2},

L = λ_{1} L_{1} + λ_{2} L_{2},

\xi_{\mu}(u)=\arg\min_{v}\big{\{}\frac{1}{2}\|v-u\|_{2}^{2}+\mu(\|v\|_{1}+\|v-w\|_{1})\big{\}}.

\xi_{\mu}(u)=\arg\min_{v}\big{\{}\frac{1}{2}\|v-u\|_{2}^{2}+\mu(\|v\|_{1}+\|v-w\|_{1})\big{\}}.

h (v) = \frac{1}{2} ∥ v - u ∥_{2}^{2} + μ (∥ v ∥_{1} + ∥ v - w ∥_{1}) .

h (v) = \frac{1}{2} ∥ v - u ∥_{2}^{2} + μ (∥ v ∥_{1} + ∥ v - w ∥_{1}) .

h (v_{i}) = \frac{1}{2} ∣ v_{i} - u_{i} ∣^{2} + μ (∥ v_{i} ∥_{1} + ∥ v_{i} - w_{i} ∥_{1}) .

h (v_{i}) = \frac{1}{2} ∣ v_{i} - u_{i} ∣^{2} + μ (∥ v_{i} ∥_{1} + ∥ v_{i} - w_{i} ∥_{1}) .

h (v) = \frac{1}{2} (v - u)^{2} + μv + μ (v - w) .

h (v) = \frac{1}{2} (v - u)^{2} + μv + μ (v - w) .

\frac{\partial h ( v )}{\partial v} = v - u + 2 μ .

\frac{\partial h ( v )}{\partial v} = v - u + 2 μ .

ξ_{μ} (u) = u - 2 μ, u > w + 2 μ .

ξ_{μ} (u) = u - 2 μ, u > w + 2 μ .

h (v) = \frac{1}{2} (v - u)^{2} + μv + μ (- v + w) = \frac{1}{2} (v - u)^{2} + μ w .

h (v) = \frac{1}{2} (v - u)^{2} + μv + μ (- v + w) = \frac{1}{2} (v - u)^{2} + μ w .

\frac{\partial h ( v )}{\partial v} = v - u .

\frac{\partial h ( v )}{\partial v} = v - u .

\frac{\partial h ( v )}{\partial v} = 0 ⟺ v = u .

\frac{\partial h ( v )}{\partial v} = 0 ⟺ v = u .

ξ_{μ} (u) = u, 0 < u < w .

ξ_{μ} (u) = u, 0 < u < w .

h (v) = \frac{1}{2} (v - u)^{2} - μv - μ (v - w) .

h (v) = \frac{1}{2} (v - u)^{2} - μv - μ (v - w) .

\frac{\partial h ( v )}{\partial v} = v - u - 2 μ .

\frac{\partial h ( v )}{\partial v} = v - u - 2 μ .

\frac{\partial h ( v )}{\partial v} = 0 ⟺ v = u + 2 μ .

\frac{\partial h ( v )}{\partial v} = 0 ⟺ v = u + 2 μ .

ξ_{μ} (u) = u + 2 μ, u < - 2 μ .

ξ_{μ} (u) = u + 2 μ, u < - 2 μ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Deep Coupled-Representation Learning for Sparse Linear Inverse Problems with Side Information

Evaggelia Tsiligianni and Nikos Deligiannis Both authors are with the Department of Electronics and Informatics, Vrije Universiteit Brussel, Brussels, Belgium, and with imec, Kapeldreef 75, B-3001, Leuven, Belgium. email: {etsiligi, ndeligia}@etrovub.be.

Abstract

In linear inverse problems, the goal is to recover a target signal from undersampled, incomplete or noisy linear measurements. Typically, the recovery relies on complex numerical optimization methods; recent approaches perform an unfolding of a numerical algorithm into a neural network form, resulting in a substantial reduction of the computational complexity. In this paper, we consider the recovery of a target signal with the aid of a correlated signal, the so-called side information (SI), and propose a deep unfolding model that incorporates SI. The proposed model is used to learn coupled representations of correlated signals from different modalities, enabling the recovery of multimodal data at a low computational cost. As such, our work introduces the first deep unfolding method with SI, which actually comes from a different modality. We apply our model to reconstruct near-infrared images from undersampled measurements given RGB images as SI. Experimental results demonstrate the superior performance of the proposed framework against single-modal deep learning methods that do not use SI, multimodal deep learning designs, and optimization algorithms.

I Introduction

Linear inverse problems arise in various signal processing domains such as computational imaging, remote sensing, seismology and astronomy, to name a few. These problems can be expressed by a linear equation of the form:

[TABLE]

where $x\in\mathbb{R}^{n}$ is the unknown signal, $\Phi\in\mathbb{R}^{m\times n}$ , $m\ll n$ , is a linear operator, and $y\in\mathbb{R}^{m}$ denotes the observations contaminated with noise $e\in\mathbb{R}^{m}$ . Sparsity is commonly used for the regularization of ill-posed inverse problems, leading to the so-called sparse approximation problem [1]. Compressed sensing (CS) [2] deals with the sparse recovery of linearly subsampled signals and falls in this category.

In several applications, besides the observations of the target signal, additional information from correlated signals is often available [3, 4, 5, 6, 7, 8, 9, 10]. In multimodal applications, combining information from multiple signals calls for methods that allow coupled signal representations, capturing the similarities between correlated data. To this end, coupled dictionary learning is a popular approach [8, 9, 10]; however, dictionary learning methods employ overcomplete dictionaries, resulting in computationally expensive sparse approximation problems.

Deep learning has gained a lot of momentum in solving inverse problems, often surpassing the performance of analytical approaches [11, 12, 13]. Nevertheless, neural networks have a complex structure and appear as “black boxes”; thus, understanding what the model has learned is an active research topic. Among the efforts trying to bridge the gap between analytical methods and deep learning is the work presented in [14], which introduced the idea of unfolding a numerical algorithm for sparse approximation into a neural network form. Several unfolding approaches [15, 16, 17] followed that of [14]. Although the primary motivation for deploying deep learning in inverse problems concerns the reduction of the computational complexity, unfolding offers another significant benefit: the model architecture allows a better insight in the inference procedure and enables the theoretical study of the network using results from sparse modelling [18, 19, 20, 15].

In this paper, we propose a deep unfolding model for the recovery of a signal with the aid of a correlated signal, the side information (SI). To the best of our knowledge, this is the first work in deep unfolding that incorporates SI. Our contribution is as follows: (i) Inspired by [14], we design a deep neural network that unfolds a proximal algorithm for sparse approximation with SI; we coin our model Learned Side Information Thresholding Algorithm (LeSITA). (ii) We use LeSITA in an autoencoder fashion to learn coupled representations of correlated signals from different modalities. (iii) We design a LeSITA-based reconstruction operator that utilizes learned SI provided by the autoencoder to enhance signal recovery.

We test our method in an example application, namely, multimodal reconstruction from CS measurements. Other inverse problems of the form (1) such as image super-resolution [21, 8] or image denoising [22] can benefit from the proposed approach. We compare our method with existing single-modal deep learning methods that do not use SI, multimodal deep learning designs, and optimization algorithms, showing its superior performance.

The paper is organized as follows. Section II provides the necessary background and reviews related work. The proposed framework is presented in Section III, followed by experimental results in Section IV. Conclusions are drawn in Section V.

II Background and Related Work

A common approach for solving problems of the form (1) with sparsity constraints is convex optimization [23]. Let us assume that the unknown $x\in\mathbb{R}^{n}$ has a sparse representation $\alpha\in\mathbb{R}^{k}$ with respect to a dictionary $D_{x}\in\mathbb{R}^{n\times k}$ , $n\leq k$ , that is, $x=D_{x}\alpha$ . Then, (1) takes the form

[TABLE]

and a solution can be obtained via the formulation of the $\ell_{1}$ minimization problem:

[TABLE]

where $\|\cdot\|_{1}$ denotes the $\ell_{1}$ -norm ( $\|\alpha\|_{1}=\sum_{i=1}^{n}|\alpha_{i}|$ ), which promotes sparse solutions and $\lambda$ is a regularization parameter.

Numerical methods [1] proposed to solve (3) include pivoting algorithms, interior-point methods, gradient based methods and message passing algorithms (AMP) [24]. Among gradient based methods, proximal methods are tailored to optimize an objective of the form

[TABLE]

where $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is a convex differentiable function with a Lipschitz-continuous gradient, and $g:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is convex and possibly nonsmooth [25], [26]. Their main step involves the proximal operator, defined for a function $g$ according to

[TABLE]

with $\theta=\frac{\lambda}{L}$ and $L>0$ an upper bound on the Lipschitz constant of $\nabla f$ . A popular proximal algorithm is the Iterative Soft Thresholding Algorithm (ISTA) [27, 28]. Let us set $F:=\Phi D_{x}$ , $F\in\mathbb{R}^{m\times k}$ in (3). At the $t$ -th iteration ISTA computes:

[TABLE]

where $\psi_{\theta}$ denotes the proximal operator [Figure 1(a)] expressed by the component-wise shrinkage function:

[TABLE]

with $u_{+}=\max\{u,0\}$ .

In order to account for the high computational cost of numerical algorithms, Gregor and LeCun [14] unfolded ISTA into a neural network referred to as LISTA. Specifically, by setting $S=I-\frac{1}{L}F^{\mkern-1.5mu\mathsf{T}}F$ , $W=\frac{1}{L}F^{\mkern-1.5mu\mathsf{T}}$ , (6) results in

[TABLE]

Considering a correspondence of every iteration with a neural network layer, a number of iterations of (8) can be implemented by a recurrent or feed forward neural network; $S$ , $W$ and $\theta$ are learnable parameters, and the proximal operator (7) acts as a nonlinear activation function. A fixed depth network allows the computation of sparse codes in a fixed amount of time. Similar unfolding methods were proposed in [15, 16, 17].

III Proposed Framework

In this paper, we consider that, besides the observations of the target signal, we also have access to SI, that is, a signal $z$ correlated to the unknown $x$ . We assume that $x\in\mathbb{R}^{n}$ and $z\in\mathbb{R}^{d}$ have similar sparse representations $\alpha\in\mathbb{R}^{k}$ , $w\in\mathbb{R}^{k}$ , under dictionaries $D_{x}\in\mathbb{R}^{n\times k}$ , $D_{z}\in\mathbb{R}^{d\times k}$ , $n\leq k$ , $d\leq k$ , respectively. Specifically, we assume that $\alpha$ and $w$ are similar by means of the $\ell_{1}$ norm, that is, $\|\alpha-w\|_{1}$ is small. The condition holds for representations with partially common support and a number of similar nonzero coefficients; we refer to them as coupled sparse representations. Then, $\alpha$ can be obtained from the $\ell_{1}$ - $\ell_{1}$ minimization problem

[TABLE]

(9) has been theoretically studied in [29] and has been employed for the recovery of sequential signals in [3, 4, 5].

We can easily obtain coupled sparse representations of sequential signals that change slowly using the same sparsifying dictionary [3, 4, 5]. However, this is not the case in most multimodal applications, where, typically, finding coupled sparse representations involves dictionary learning and complex optimization methods [8, 9, 10]. In this work, we propose an efficient approach based on a novel multimodal deep unfolding model. The model is employed for learning coupled representations of the target signal and the SI (Section III-B), and for reconstruction with SI (Section III-C). Our approach is inspired by a proximal algorithm for the solution of (9).

III-A * Sparse Approximation with SI via Deep Unfolding*

Problem (9) is of the form (4) with $f(\alpha)=\frac{1}{2}\|F\alpha-y\|_{2}^{2}$ , $F:=\Phi D_{x}$ , $F\in\mathbb{R}^{m\times k}$ , and $g(\alpha)=\|\alpha\|_{1}+\|\alpha-w\|_{1}$ . The proximal operator for $g$ is defined by

[TABLE]

where $\mu=\frac{\lambda}{L}$ , and $L>0$ is an upper bound on the Lipschitz constant of $\nabla f$ . All terms in (10) are separable, thus, we can easily show that (see Appendix):

For $w_{i}\geq 0$ , $i=1,\dots,k$ :

[TABLE] 2. 2.

For $w_{i}<0$ , $i=1,\dots,k$ :

[TABLE]

Figure 1(b) depicts the graphical representation of the proximal operator given by (11). With $\nabla f(\alpha)=F^{\mkern-1.5mu\mathsf{T}}(F\alpha-y)$ , a proximal method for (9) takes the form

[TABLE]

We coin (13) Side-Information-driven iterative soft Thresholding Algorithm (SITA).

We unfold SITA to a neural network form, by settting $Q=I-\frac{1}{L}F^{\mkern-1.5mu\mathsf{T}}F$ , $R=\frac{1}{L}F^{\mkern-1.5mu\mathsf{T}}$ . Then (13) results in

[TABLE]

(14) has a similar expression to LISTA (8); however, the two algorithms involve different proximal operators (Figure 1). A fixed number of iterations of (14) can be implemented by a recurrent or feed forward neural network, with the proximal operator given by (11), (12) employed as a nonlinear activation function, which integrates the SI; $Q$ , $R$ and $\mu$ are learnable parameters. The network architecture is depicted in Figure 2.

We can train the neural network using $J$ pairs of sparse codes $\{\alpha_{(j)},w_{(j)}\}_{j=1}^{J}$ corresponding to $J$ pairs of correlated signals $\{x_{(j)},z_{(j)}\}_{j=1}^{J}$ , and a loss function of the form:

[TABLE]

where $\hat{\alpha}_{(j)}$ is the output estimation. The learning results in a fast sparse approximation operator that directly maps the input observation vector $y$ to a sparse code $\alpha$ with the aid of the SI $w$ . We coin this operator Learned Side Information Thresholding Algorithm (LeSITA).

Being based on an optimization method, LeSITA can be theoretically analyzed (see [7, 18, 19, 20, 15]). We leave this analysis for future work.

III-B LeSITA Autoencoder for Coupled Representations

Instead of training using sparse codes, we can use LeSITA in an autoencoder fashion to learn coupled representations of $x$ , $z$ . By setting $\Phi$ equal to the identity matrix, (9) reduces to a sparse representation problem with SI. Then, (14) can compute a representation of $x$ according to $\alpha^{t}=\xi_{\mu}\big{(}Q\alpha^{t-1}+Rx\big{)}$ . The proposed autoencoder is depicted in Figure 3. The main branch accepts as input the target signal $x$ ( $y=x$ ). The core component is a LeSITA encoder, followed by a linear decoder performing reconstruction, i.e., $\hat{x}=D\alpha$ ; $D\in\mathbb{R}^{n\times k}$ is a trainable dictionary ( $D$ is not tied to any other weight). A second branch referred to as SINET acts as an SI encoder, performing a (possibly) nonlinear transformation of the SI. We employ LISTA (8) to incorporate sparse priors in the transformation, obtaining $w^{t}=\psi_{\theta}\big{(}Sw^{t-1}+Wz\big{)}$ , $w^{0}=0$ ; $\psi_{\theta}$ is given by (7), and $S$ , $W$ and $\theta$ are learnable parameters. The number of layers of LISTA and LeSITA may differ.

We use $J$ pairs of correlated signals $\{x_{(j)},z_{(j)}\}_{j=1}^{J}$ to train our autoencoder, and an objective function of the form:

[TABLE]

where $\mathcal{L}_{1}$ is the reconstruction loss, $\mathcal{L}_{2}$ is a constraint on the latent representations, and $\lambda_{1}$ , $\lambda_{2}$ are appropriate weights. We use the $\ell_{2}$ norm as reconstruction loss, i.e., $\mathcal{L}_{1}=\sum_{j=1}^{J}\|x_{(j)}-\hat{x}_{(j)}\|_{2}^{2}$ , where $x_{(j)}$ is the $j$ -th sample of the target signal and $\hat{x}_{(j)}$ is the respective output estimation. We set $\mathcal{L}_{2}=\sum_{j=1}^{J}\|\alpha_{(j)}-w_{(j)}\|_{1}$ to promote coupled latent representations capturing the correlation between $x_{(j)}$ and $z_{(j)}$ .

III-C LeSITA for Reconstruction with SI

We propose a reconstruction operator that effectively utilizes SI for signal recovery, following the architecture of Figure 3. In the main branch, a LeSITA encoder computes a latent representation $\alpha$ of the observation vector $y$ obtained from (1), according to (14). A linear decoder performs reconstruction of the unknown signal, i.e., $\hat{x}=D\alpha$ ; $D\in\mathbb{R}^{n\times k}$ is a learnable dictionary. The role of the SINET branch is to enhance the encoding process by providing LeSITA with prior knowledge. In this task, the SINET is realized by a LISTA encoder, the weights of which are initialized with the SINET weights of the trained autoencoder (Sec. III-B). In this way, the LeSITA autoencoder is used to provide coupled sparse representations. The proposed model is trained using the $\ell_{2}$ loss function, $\mathcal{L}=\sum_{j=1}^{J}\|x_{(j)}-\hat{x}_{(j)}\|_{2}^{2}$ , with $x_{(j)}$ the $j$ -th sample of the target signal and $\hat{x}_{(j)}$ the respective model estimation.

IV Experimental results

A first set of experiments concerns the performance of the proposed LeSITA model (14) in sparse approximation using synthetic data. We generate $J=500$ K pairs of sparse signals $\{\alpha_{(j)},w_{(j)}\}_{j=1}^{J}$ of length $k=256$ with $s=25$ nonzero coefficients drawn from a standard normal distribution. The sparsity level is kept fixed but the signals have varying support. The SI is generated such that $\alpha_{(j)}$ and $w_{(j)}$ share the same support $\mathcal{I}_{(j)}$ in a number of positions $\rho\leq s$ , that is, $\mathcal{I}_{(j)}=\{i:w_{(j)}[i]\neq 0,\alpha_{(j)}[i]\neq 0\}$ , $|\mathcal{I}_{(j)}|=\rho$ , with $\alpha_{(j)}[i]$ , $w_{(j)}[i]$ denoting the $i$ -th coefficient of the respective signals. For $i\in\mathcal{I}_{(j)}$ , we obtain $w_{(j)}[i]=\lvert\kappa\lvert\alpha_{(j)}[i]$ , where $\kappa$ is drawn from a normal distribution; therefore, for $i\in\mathcal{I}_{(j)}$ , the coefficients of $\alpha_{(j)}$ and $w_{(j)}$ are of the same sign; the rest are drawn from a standard normal distribution. We vary the values of $\rho$ , i.e., $\rho=\{25,20,15,10\}$ , to obtain different levels of similarity between $\alpha$ and $w$ . A random Gaussian matrix $D_{x}\in\mathbb{R}^{128\times 256}$ is used as a sparsifying dictionary and $\Phi$ is set equal to the $128\times 128$ identity matrix. We use $5\%$ of the generated samples for validation and $10\%$ for testing.

We design a LeSITA (14) and a LISTA (8) model to learn sparse codes of the target signal. Different instantiations of both models are realized with different number of layers, i.e., $T=\{3,5,7\}$ . Average results are presented in Table I in terms of normalized mean square error (NMSE) in dB. When the involved signals are similar, i.e., $\rho=\{25,20\}$ , LeSITA outperforms LISTA substantially. The SI has a negative effect in reconstruction when the support differs in more than $40\%$ positions. The results also show that deeper models deliver better accuracy. Moreover, Table I includes results for SITA (13) after $T=\{3,5,7\}$ iterations, for $\rho=\{25,15\}$ . We also run (13) with the following stopping criteria: maximum number of iterations $T_{\max}=1000$ , minimum error equal to the error delivered by LeSITA ( $T=7$ ) for $\rho=\{25,20,15,10\}$ . The respective average NMSE is $\{-32.35,-29.92,-26.88,-23.92\}$ dB corresponding to $\{688,375,305,308\}$ iterations (on average). The comparison shows the computational efficiency of LeSITA against SITA.

A second set of experiments involves real data from the EPFL dataset.111https://ivrl.epfl.ch/supplementary_material/cvpr11/ The dataset contains spatially aligned pairs of near-infrared (NIR) and RGB images grouped in nine categories, e.g., “urban” and “forest”. Our goal is to reconstruct linearly subsampled NIR images (acquired as $y=\Phi x$ , $\Phi\in\mathbb{R}^{m\times n}$ , $m\ll n$ ) with the aid of RGB images. We convert the available images to grayscale and extract pairs of $16\times 16$ image patches ( $n=256$ ), creating a dataset of $500$ K samples. One image from each category is reserved for testing.222 In Table II, an image is identified by a code following the category name.

We design a LeSITA-based reconstruction operator with each LeSITA and LISTA encoders comprising $T=7$ layers, initialized with weights learned from a LeSITA autoencoder. The autoencoder model was initialized with a random Gaussian dictionary $D_{x}\in\mathbb{R}^{256\times 512}$ and trained using (16) with $\lambda_{1}=\lambda_{2}=0.5$ . Besides $\mathcal{L}_{2}^{\text{A}}=\sum_{j=1}^{J}\|\alpha_{(j)}-w_{(j)}\|_{1}$ , we also experiment with $\mathcal{L}_{2}^{\text{B}}=\sum_{j=1}^{J}\|\alpha_{(j)}\|_{1}+\|w_{(j)}\|_{1}$ . For every testing image, we extract the central $256\times 256$ part and divide it into $16\times 16$ patches with an overlapping stride equal to $4$ . We apply CS with different ratios ( $m/n$ ) to NIR image patches.

We compare our reconstruction operator with (i) a LISTA-based [14] reconstruction operator with $T=7$ layers, (ii) a LAMP-based [17] reconstruction operator with $T=7$ layers, (iii) a deep learning (DL) model proposed in [12], and (iv) a multimodal DL model inspired from [30, 31]; note that [14], [17] and [12] do not use SI. The multimodal model consists of two encoding and a single decoding branches. The target and SI encodings are concatenated to obtain a shared latent representation which is received by the decoder to estimate the target signal. Each encoding branch comprises three ReLU layers of dimension $512$ . The decoding branch comprises one ReLU and one linear layer. In all experiments, the projection matrix $\Phi\in\mathbb{R}^{m\times 256}$ is jointly learned with the reconstruction operator.333The model in [12] learns sparse ternary projections. Results presented in Table II in terms of peak signal-to-noise ratio (PSNR) show that LeSITA trained with $\mathcal{L}_{2}^{\text{A}}$ manages to capture the correlation between the target and the SI signals and outperforms all the other models.

V Conclusions and Future Work

We proposed a fast reconstruction operator for the recovery of an undersampled signal with the aid of SI. Our framework utilizes a novel deep learning model that produces coupled representations of correlated data, enabling the efficient use of the SI in the reconstruction of the target signal. Following design principles that rely on existing convex optimization methods allows the theoretical study of the proposed representation and reconstruction models, using sparse modelling and convex optimization theory. We will explore this research direction in our future work.

The proximal operator for (9) has been defined in (10) as follows:

[TABLE]

Let us set

[TABLE]

Considering that the minimization of $h(v)$ is separable, for the $i$ -th component of the vectors involved in (17), we obtain

[TABLE]

Hereafter, we abuse the notation by omitting the index $i$ and denoting as $v$ , $u$ , $w$ the $i$ -th component of the corresponding vectors.

Let $w\geq 0$ . Then we consider the following five cases:

If $0<w<v$ then

[TABLE]

The partial derivative with respect to $v$ is

[TABLE]

$h(v)$ is minimized at $\frac{\partial h(v)}{\partial v}=0$ , that is, $v=u-2\mu$ . For $v>w$ , we obtain $u>w+2\mu$ . Therefore,

[TABLE] 2. 2.

If $0<v<w$ , then

[TABLE]

For $0<v<w$ , we obtain $0<u<w$ , thus,

[TABLE] 3. 3.

If $v<0$ , then

[TABLE]

For $v<0$ , we obtain $u+2\mu<0$ or $u<-2\mu$ , thus,

[TABLE] 4. 4.

If $v=0$ , then

[TABLE]

where $\partial[\cdot]$ denotes the subgradient. Thus,

[TABLE]

and the proximal operator is given by

[TABLE] 5. 5.

If $v=w$ , then

[TABLE]

Thus,

[TABLE]

and the proximal operator is given by

[TABLE]

Therefore, for $w\geq 0$ , (21), (25), (29), (32), and (35) result in:

[TABLE]

Similarly, we calculate the proximal operator for $w<0$ .

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. A. Tropp and S. J. Wright, “Computational methods for sparse solution of linear inverse problems,” Proceedings of the IEEE , vol. 98, no. 6, pp. 948–958, 2010.
2[2] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory , vol. 52, no. 4, pp. 1289–1306, 2006.
3[3] Y. Zhang, “On theory of compressive sensing via l 1 minimization: simple derivations and extensions,” Rice University, Tech. Rep., 2008.
4[4] L. Weizman, Y. C. Eldar, and D. Ben Bashat, “Compressed sensing for longitudinal MRI: An adaptive-weighted approach,” Medical Physics , vol. 42, no. 9, pp. 5195–5208, 2015.
5[5] J. F. C. Mota, N. Deligiannis, A. C. Sankaranarayanan, V. Cevher, and M. R. D. Rodrigues, “Dynamic sparse state estimation using ℓ 1 − ℓ 1 subscript ℓ 1 subscript ℓ 1 \ell_{1}-\ell_{1} minimization: Adaptive-rate measurement bounds, algorithms and applications,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2015, pp. 3332–3336.
6[6] N. Vaswani and W. Lu, “Modified-CS: Modifying compressive sensing for problems with partially known support,” IEEE Transactions on Signal Processing , vol. 58, no. 9, pp. 4595–4607, 2010.
7[7] A. Ma, Y. Zhou, C. Rush, D. Baron, and D. Needell, “An Approximate Message Passing Framework for Side Information,” IEEE Transactions on Signal Processing , vol. 67, no. 7, pp. 1875–1888, 2019.
8[8] P. Song, J. F. Mota, N. Deligiannis, and M. R. Rodrigues, “Coupled dictionary learning for multimodal image super-resolution,” in 2016 IEEE Global Conference on Signal and Information Processing (Global SIP) , 2016, pp. 162–166.