Deep Residual Autoencoders for Expectation Maximization-inspired   Dictionary Learning

Bahareh Tolooshams; Sourav Dey; Demba Ba

arXiv:1904.08827·cs.LG·October 20, 2020

Deep Residual Autoencoders for Expectation Maximization-inspired Dictionary Learning

Bahareh Tolooshams, Sourav Dey, Demba Ba

PDF

1 Repo

TL;DR

This paper presents CRsAE, a neural network architecture inspired by Expectation-Maximization for convolutional dictionary learning, demonstrating superior performance in image denoising and neural spike detection tasks.

Contribution

The introduction of CRsAE, a novel autoencoder architecture that unifies dictionary learning with neural networks and incorporates EM-inspired training for improved results.

Findings

01

Learns Gabor-like filters in image denoising.

02

Speeds up spike detection by 900x in neural data.

03

Outperforms conventional methods in accuracy and efficiency.

Abstract

We introduce a neural-network architecture, termed the constrained recurrent sparse autoencoder (CRsAE), that solves convolutional dictionary learning problems, thus establishing a link between dictionary learning and neural networks. Specifically, we leverage the interpretation of the alternating-minimization algorithm for dictionary learning as an approximate Expectation-Maximization algorithm to develop autoencoders that enable the simultaneous training of the dictionary and regularization parameter (ReLU bias). The forward pass of the encoder approximates the sufficient statistics of the E-step as the solution to a sparse coding problem, using an iterative proximal gradient algorithm called FISTA. The encoder can be interpreted either as a recurrent neural network or as a deep residual network, with two-sided ReLU non-linearities in both cases. The M-step is implemented via a…

Tables6

Table 1. TABLE I: Notation and conventions

Symbol	Description
$𝐇$	Matrix (upper-case bold letters)
$𝐡$	Vector (lower-case bold letters)
$h_{c} [i]$	$i^{th}$ element of the vector $𝐡_{c}$
$𝐳_{t}$	Vector $𝐳$ at the $t^{th}$ iteration of an iterative algorithm
$z_{t} [n]$	$n^{th}$ element of $𝐳_{t}$
$𝐇^{T}$	Transpose of $𝐇$
$𝐡^{T}$	Transpose of $𝐡$
$𝐲^{j}$	$j^{th}$ training example (window of data)
${‖ 𝐱 ‖}_{p}$	The $ℓ_{p}$ -norm of vector $𝐱$
$σ_{max} (𝐇)$	Maximum eigenvalue of the matrix $𝐇$
$𝐈$	The identity matrix
$𝕀$	The indicator function
$*$	Linear convolution
$i . i . d .$	Independent and identically distributed
$𝒩 (m, σ^{2})$	Gaussian distribution with mean $m$ and variance $σ^{2}$

Table 2. TABLE II: Comparison of classical dictionary learning, EM-based dictionary learning, and CRsAE.

	Sparse Coding ( $𝐱$ )	Dictionary Learning ( $𝐇$ )	Regularization ( $λ$ )
Dictionary learning	$\begin{matrix} 𝐱^{(l)} = \underset{𝐱}{\arg \min} \frac{1}{2 σ^{2}} {‖ 𝐲 - 𝐇^{(l - 1)} 𝐱 ‖}_{2}^{2} + λ {‖ 𝐱 ‖}_{1} \end{matrix}$	$\begin{matrix} 𝐇^{(l)} = \underset{{𝐡_{c}}_{c = 1}^{C}}{\arg \min} & \frac{1}{2} {‖ 𝐲 - {𝐇𝐱}^{(l)} ‖}_{2}^{2} \\ s.t. & {‖ 𝐡_{c} ‖}_{2} = 1 \end{matrix}$	Hyperparameter
EM	$\begin{matrix} 𝐱^{(l)} = \underset{𝐱}{\arg \max} \log P (𝐱 ∣ 𝐲, 𝐇^{(l - 1)}, λ^{(l - 1)}, σ^{2}) \end{matrix}$	$\begin{matrix} 𝐇^{(l)} = \underset{{𝐡_{c}}_{c = 1}^{C}}{\arg \min} & \frac{1}{2} {‖ 𝐲 - {𝐇𝐱}^{(l)} ‖}_{2}^{2} \\ s.t. & {‖ 𝐡_{c} ‖}_{2} = 1 \end{matrix}$	$\begin{matrix} λ^{(l)} = & \underset{λ}{\arg \min} λ ({‖ 𝐱^{(l)} ‖}_{1} + C δ) \\ - (N_{e} + r - 1) C \log λ \end{matrix}$
CRsAE	${CRsAE}_{enc.} (𝐲, 𝐡^{(l - 1)}, λ^{(l - 1)}, σ, L)$	$𝐡^{(l)}$ $-$ $= η_{𝐡} \nabla_{𝐡} ℒ_{H} (𝐲, 𝐡^{(l - 1)}, λ^{(l - 1)}, σ, L)$	$λ^{(l)}$ $-$ $= η_{λ} \nabla_{λ} ℒ_{λ} (𝐲, 𝐡^{(l - 1)}, λ^{(l - 1)}, σ, L)$

Table 3. TABLE III: Details of Datasets and Training Parameters for Experiments on 1D Neural Data.

	Simulated	Real
Length of data $T_{0}$ [min]	$17$	$4$
Firing rate of each neuron [Hz]	$30$	-
Sampling frequency $f_{s}$ [Hz]	$10, 000$	$10, 000$
Sparseness of code ${‖ 𝐱^{j} ‖}_{0}$	$3 \times$ $4$ $=$ $12$	-
$σ$	set based on SNR	0.03
# filters $C$	$4$	$2$
Filter size $K$	$18$	$35$
Length of each example $N$	$1, 000$	$60, 000$
# examples	$10, 100$	$24$
# training examples	$9, 000$	$63$ augmented
# validation examples	$1, 000$	$3$
# testing examples	$100$	-
# trainable parameters	$(18 \times$ $4)$ $+ 1$ $=$ $73$	$(35 \times$ $2)$ $+ 1$ $=$ $71$
FISTA iterations $T$	$180$	$600$
Batch size $B$	$1024$	$4$
$L$	$13.5$	$15$
$λ_{init}$	$\frac{\sqrt{2 \log (C \times N_{e})}}{σ}$	$\frac{\sqrt{2 \log (C \times N_{e})}}{σ}$
$δ$	$50$	$1, 000$
$r$	$δ λ_{init}$	$δ λ_{init}$
$𝐡^{init}$	-	k-means
$err (𝐡_{c}, 𝐡_{c}^{init})$	$- 3$ to $- 4$	-
$η_{𝐡}$	$10^{- 5}$ to $10^{- 1}$	$10^{- 5}$ to $10^{- 1}$
$η_{λ}$	$1$ to $5$	$1$ to $5$

Table 4. TABLE IV: Details of Speed Analysis

		CRsAE	Sporco	CBP
Dictionary Learning	runtime	69.27 s	$319.52$ s
Dictionary Learning	iterations	10	$89$
Spike Sorting	runtime	0.93 s		$17$ hours

Table 5. TABLE V: Details of Datasets and Training Parameters for Image Experiments.

# filters $C$	tied	untied
	$64$	$3$ $\times$ $64$ $=$ $192$
Filter size $K$	$7$ $\times$ $7$
Strides	$5$
Patch size	$128$ $\times$ $128$
# training examples	$5, 717$ VOC
# testing examples	$9$
# trainable parameters	EM or LS	hyp	EM or LS-untied
	$3, 200$	$3, 136$	$9, 472$
Encoder layers $T$	$30$
Batch size $B$	$1$
$L$	$10$
$λ_{init}$	EM	hyp	LS
	20	0.15	0.015
$𝐡^{init}$	Random Gaussian
$η_{𝐡}$	$0.01$
$η_{λ}$	EM	hyp	LS
	$0.1$	-	$0.01$
$η_{𝐡}$ decay	$0.7$
$η_{𝐡}$ step	$10$ epochs
$η_{λ}$ decay	EM	hyp	LS
	-	-	$0.7$
$η_{λ}$ step	EM	hyp	LS
	-	-	$10$ epochs

Table 6. TABLE VI: Denoising Performance of CRsAE on test images quantified by PSNR. The number of trainable parameters for each network is shown in the last column. LCSC results are from [ 14 ] .

	Lena	Barbara	Boat	Couple	Fgpr	Hill	House	Man	Peppers	# Parameters
CRsAE- $λ_{EM}$ -shallow	29.74	26.19	28.89	28.69	26.69	28.50	29.76	28.51	28.39	3,200
CRsAE- $λ_{hyp}$	30.17	27.57	29.21	28.81	27.08	28.35	30.36	28.56	29.41	3,136
CRsAE- $λ_{LS}$	30.22	27.65	29.39	29.17	27.23	28.84	30.23	28.98	29.51	3,200
CRsAE- $λ_{EM}$	30.56	28.10	29.61	29.35	27.42	29.04	30.58	29.21	29.83	3,200
CRsAE- $λ_{LS}$ -untied	31.44	28.72	30.17	29.95	27.46	29.45	31.63	29.67	30.67	9,472
CRsAE- $λ_{EM}$ -untied	31.83	28.92	30.41	30.20	27.82	29.67	32.11	29.89	30.95	9,472
CRsAE- $λ_{EM}$ -untied-msssim	32.11	28.93	30.46	30.26	27.73	29.66	32.48	29.92	31.14	9,472
LCSC	32.11	28.91	30.30	30.14	27.44	30.23	32.55	30.29	30.65	9,472

Equations103

y (t) = c = 1 \sum C i = 1 \sum N_{c} x_{c, i} h_{c} (t - τ_{c, i}) + v (t) for t \in [0, T_{0}),

y (t) = c = 1 \sum C i = 1 \sum N_{c} x_{c, i} h_{c} (t - τ_{c, i}) + v (t) for t \in [0, T_{0}),

y [n] = c = 1 \sum C i = 1 \sum N_{c} x_{c, i} h_{c} [n -

y [n] = c = 1 \sum C i = 1 \sum N_{c} x_{c, i} h_{c} [n -

for n = 1, \dots, ⌊ \frac{T _{0}}{f _{s}} ⌋, n_{c, i} = ⌊ \frac{τ _{c, i}}{f _{s}} ⌋,

y = [H_{1} ∣ \dots ∣ H_{C}] x_{1} ⋮ x_{C} + v = Hx + v,

y = [H_{1} ∣ \dots ∣ H_{C}] x_{1} ⋮ x_{C} + v = Hx + v,

y^{j} = H x^{j} + v^{j} for j = 1, \dots, J,

y^{j} = H x^{j} + v^{j} for j = 1, \dots, J,

(P_{0}) : x^{j} min ∥ x^{j} ∥_{0} s.t. H x^{j} = y^{j} .

(P_{0}) : x^{j} min ∥ x^{j} ∥_{0} s.t. H x^{j} = y^{j} .

(P_{1}) : x^{j} min ∥ x^{j} ∥_{1} s.t. H x^{j} = y^{j} .

(P_{1}) : x^{j} min ∥ x^{j} ∥_{1} s.t. H x^{j} = y^{j} .

(P_{2}) : x^{j} min ∥ x^{j} ∥_{1} s.t. ∥ y^{j} - H x^{j} ∥_{2} \leq ϵ .

(P_{2}) : x^{j} min ∥ x^{j} ∥_{1} s.t. ∥ y^{j} - H x^{j} ∥_{2} \leq ϵ .

{x^{j}}_{j = 1}^{J}, {h_{c}}_{c = 1}^{C} min j = 1 \sum J ∥ x^{j} ∥_{1} s.t.

{x^{j}}_{j = 1}^{J}, {h_{c}}_{c = 1}^{C} min j = 1 \sum J ∥ x^{j} ∥_{1} s.t.

∥ h_{c} ∥_{2} = 1 for c = 1, \dots, C,

x^{j} min \frac{1}{2} ∥ y^{j} - H x^{j} ∥_{2}^{2} + λ ∥ x^{j} ∥_{1},

x^{j} min \frac{1}{2} ∥ y^{j} - H x^{j} ∥_{2}^{2} + λ ∥ x^{j} ∥_{1},

{h_{c}}_{c = 1}^{C} min j = 1 \sum J \frac{1}{2} ∥ y^{j} - H x^{j} ∥_{2}^{2} s.t. ∥ h_{c} ∥_{2} = 1 for c = 1, \dots, C .

{h_{c}}_{c = 1}^{C} min j = 1 \sum J \frac{1}{2} ∥ y^{j} - H x^{j} ∥_{2}^{2} s.t. ∥ h_{c} ∥_{2} = 1 for c = 1, \dots, C .

y = Hx + v,

y = Hx + v,

y ∣ x, H; σ^{2} \sim N (Hx, σ^{2} I),

y ∣ x, H; σ^{2} \sim N (Hx, σ^{2} I),

P_{Y} (y ∣ x, H; σ^{2}) = \frac{1}{( 2 π σ ^{2} ) ^{\frac{N}{2}}} e^{\frac{- 1}{2 σ ^{2}} ∥ y - Hx ∥_{2}^{2}}

P_{Y} (y ∣ x, H; σ^{2}) = \frac{1}{( 2 π σ ^{2} ) ^{\frac{N}{2}}} e^{\frac{- 1}{2 σ ^{2}} ∥ y - Hx ∥_{2}^{2}}

P (x_{c} ∣ λ) = k = 1 \prod N_{e} \frac{λ}{2} e^{- λ ∣ x_{c} [k] ∣} = (\frac{λ}{2})^{N_{e}} e^{- λ ∥ x_{c} ∥_{1}}

P (x_{c} ∣ λ) = k = 1 \prod N_{e} \frac{λ}{2} e^{- λ ∣ x_{c} [k] ∣} = (\frac{λ}{2})^{N_{e}} e^{- λ ∥ x_{c} ∥_{1}}

P (λ) = \frac{δ ^{r}}{Γ ( r )} λ^{r - 1} e^{- δ λ},

P (λ) = \frac{δ ^{r}}{Γ ( r )} λ^{r - 1} e^{- δ λ},

P (x, H, λ) = c = 1 \prod C (\frac{λ}{2})^{N_{e}} e^{- λ ∥ x_{c} ∥_{1}} P (H) P (λ),

P (x, H, λ) = c = 1 \prod C (\frac{λ}{2})^{N_{e}} e^{- λ ∥ x_{c} ∥_{1}} P (H) P (λ),

lo g P (y, x, H, λ; σ^{2})

lo g P (y, x, H, λ; σ^{2})

= lo g P (y ∣ x, H, λ; σ^{2}) + lo g P (x, H, λ) + const.

= - \frac{1}{2 σ ^{2}} ∥ y - Hx ∥_{2}^{2} - λ ∥ x ∥_{1}

+ C lo g P (λ) + N_{e} C lo g λ + const.,

Q (θ ∣ θ^{(l - 1)}) = E_{x} [lo g P (y ∣ x, H, λ; σ^{2}) ∣ y, H^{(l - 1)}, λ^{(l - 1)}; σ^{2}] .

Q (θ ∣ θ^{(l - 1)}) = E_{x} [lo g P (y ∣ x, H, λ; σ^{2}) ∣ y, H^{(l - 1)}, λ^{(l - 1)}; σ^{2}] .

x^{(l)}

x^{(l)}

\displaystyle=\operatorname*{arg\,min}_{\mathbf{x}}\ \frac{1}{2\sigma^{2}}\|\mathbf{y}-\mathbf{H}^{(l\scalebox{0.75}[1.0]{$-$}1)}\mathbf{x}\|^{2}_{2}+\lambda^{(l\scalebox{0.75}[1.0]{$-$}1)}\|\mathbf{x}\|_{1}.

Q (θ ∣ θ^{(l - 1)})

Q (θ ∣ θ^{(l - 1)})

\propto - \frac{1}{2 σ ^{2}} ∥ y - H x^{(l)} ∥_{2}^{2} - λ ∥ x^{(l)} ∥_{1}

+ C lo g P (λ) + N_{e} C lo g λ .

H^{(l)} = H arg min \frac{1}{2} ∥ y - H x^{(l)} ∥_{2}^{2}

H^{(l)} = H arg min \frac{1}{2} ∥ y - H x^{(l)} ∥_{2}^{2}

λ^{(l)} = λ arg min λ (∥ x^{(l)} ∥_{1} + C δ) - (N_{e} + r - 1) C lo g λ .

λ^{(l)} = λ arg min λ (∥ x^{(l)} ∥_{1} + C δ) - (N_{e} + r - 1) C lo g λ .

x min \frac{1}{2 σ ^{2}} ∥ y - Hx ∥_{2}^{2} + λ ∥ x ∥_{1},

x min \frac{1}{2 σ ^{2}} ∥ y - Hx ∥_{2}^{2} + λ ∥ x ∥_{1},

x_{t} = S_{\frac{λ σ ^{2}}{L}} (x_{t - 1} + \frac{1}{L} H^{T} (y - H x_{t - 1})),

x_{t} = S_{\frac{λ σ ^{2}}{L}} (x_{t - 1} + \frac{1}{L} H^{T} (y - H x_{t - 1})),

(S_{b} (z))_{n}

(S_{b} (z))_{n}

= ReLU (z [n] - b) - ReLU (- z [n] - b),

CRsAE (y, h, λ, σ, L) = H CRsAE_{enc.} (y, h, λ, σ, L) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ds2p/crsae
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Methods*Communicated@Fast*How Do I Communicate to Expedia?

Full text

Preprint Notice:

**(c) 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. **

Deep Residual Autoencoders for Expectation Maximization-inspired Dictionary Learning

Bahareh Tolooshams, Sourav Dey, and Demba Ba B. Tolooshams and D. Ba is with the School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, 02138 USA e-mail: ([email protected]; [email protected])S. Dey is with Manifold AI, Oakland, CA e-mail:([email protected])Manuscript accepted June, 2020.

Abstract

We introduce a neural-network architecture, termed the constrained recurrent sparse autoencoder (CRsAE), that solves convolutional dictionary learning problems, thus establishing a link between dictionary learning and neural networks. Specifically, we leverage the interpretation of the alternating-minimization algorithm for dictionary learning as an approximate Expectation-Maximization algorithm to develop autoencoders that enable the simultaneous training of the dictionary and regularization parameter (ReLU bias). The forward pass of the encoder approximates the sufficient statistics of the E-step as the solution to a sparse coding problem, using an iterative proximal gradient algorithm called FISTA. The encoder can be interpreted either as a recurrent neural network or as a deep residual network, with two-sided ReLU non-linearities in both cases. The M-step is implemented via a two-stage back-propagation. The first stage relies on a linear decoder applied to the encoder and a norm-squared loss. It parallels the dictionary update step in dictionary learning. The second stage updates the regularization parameter by applying a loss function to the encoder that includes a prior on the parameter motivated by Bayesian statistics. We demonstrate in an image-denoising task that CRsAE learns Gabor-like filters, and that the EM-inspired approach for learning biases is superior to the conventional approach. In an application to recordings of electrical activity from the brain, we demonstrate that CRsAE learns realistic spike templates and speeds up the process of identifying spike times by 900x compared to algorithms based on convex optimization.

Index Terms:

Dictionary Learning, Convolutional Sparse Coding, autoencoders, Deep Residual Networks, Expectation-Maximization, Spike Sorting, Image Denoising.

I Introduction

SPARSE coding has become a popular method for extracting features from data. Sparse coding problems assume the data $\mathbf{y}$ can be represented as a sparse linear combination of columns (features) of a matrix $\mathbf{H}$ , termed a dictionary. Given the dictionary $\mathbf{H}$ , methods such as orthogonal matching pursuit [1] and basis pursuit [2] find the sparse representation. The Lasso is an alternative method to estimate the sparse code by penalizing the $\ell_{1}$ -norm of the linear regression coefficients [3]. The performance of this regression highly depends on the choice of the regularization parameter. The sparse coding problem also has connections to Bayesian statistics, where Monte-Carlo Expectation-Maximization (EM) [4] can be used for estimation of hyperparameters such as the regularization parameter [5, 6]. In this line of thought, treating the sparse codes as continuous mixtures of Gaussians makes it possible to obtain Gibbs samples from the missing data of interest, namely the sparse codes and the hidden variance of each of the Gaussians in the scale mixture. The M-step uses the Gibbs samples to approximate the intractable expectations necessary to update the hyperparameters [6]. As with most Monte-Carlo methods, the Gibbs sampling step can be computationally expensive, particularly when the sparse codes are high-dimensional.

The dictionary in the sparse coding problem can be analytical (e.g., discrete wavelet transform [7]), have random structures (e.g., Gaussian random matrix), or be learned from data for a better adaptation. In the signal processing literature, dictionary learning is the de-facto method for learning the sparse representations in an unsupervised manner. The method of optimal directions [8] and K-SVD [9] are two examples of popular dictionary learning algorithms. In convolutional dictionary learning (CDL), which has drawn attention in signal and image processing due to its ability to produce shift-invariant sparse representations, $\mathbf{H}$ has a block-Toeplitz structure consisting of the concatenation of a finite number of Toeplitz matrices, each corresponding to a convolutional filter (atom) [10, 11]. CDL is a non-convex optimization problem. A popular method to circumvent this non-convexity is to alternate between a convolutional sparse coding (CSC) step and a convolutional dictionary update step until a convergence criterion is met, a procedure referred to as alternating minimization [12]. The state-of-the-art algorithms solve the CSC step through variants of ADMM [13]. While efficient, these algorithms lack a scalable infrastructure to solve the CSC step in parallel for thousands of examples.

A recent line of work has sought to draw connections between autoencoders (AE)s and dictionary learning [14, 15, 16]. Expressing sparse coding and dictionary learning as a feed-forward neural network lets us take advantage of the parallelism offered by GPUs to speed up learning. In this line of work, the encoder defines a mapping from the input data space to a sparse vector similar to the sparse coding step, and the decoder, imitating the dictionary update step, uses a set of filters (dictionary) to reconstruct the input data from the approximated sparse codes. We showed in our previous work [17] that, unless the weights of the encoder and decoder are tied, the architectures from [14] and [15] do not, strictly speaking, perform dictionary learning. That is, for the weights of an AE to be interpretable as convolutional filters in a CDL problem, encoder and decoder weights must necessarily be tied. Besides, we showed using simulated data that the AE introduced in [14] is not able to learn the ground-truth underlying convolutional dictionary because of a) encoder and decoder weights are not tied, and b) the codes produced by the encoder are not sparse enough to enable dictionary learning. In the context of dense (as opposed to convolutional) dictionary learning, the $k$ -sparse AE [16] imposes the afore-mentioned constraint and uses a variant of iterative hard-thresholding [18] in its encoder to ensure that the encoder outputs sparse codes. The efficacy of limiting the code to be $k$ -sparse through hard-thresholding is not apparent in the convolutional case. Additionally, in all of the afore-mentioned works, a principled method for training the regularization parameters (ReLU biases) seems to be missing.

The next section, Section II, summarizes our contributions. Section III introduces notation and the generative model. We review classical CDL in Section IV. Section V discusses the EM approach to dictionary learning. We introduce a deep residual AE architecture motivated by EM in Section VI. In Section VII, we compare the EM-inspired deep residual AE to existing algorithms for CDL and apply it to spike sorting and image denoising. We conclude in Section VIII.

II Summary of Contributions

The contributions of this paper are

A deep residual AE for CDL: We propose an AE architecture for CDL (Figure 1) that is an extension of that introduced in [17]. The encoder in this architecture is a variant of residual [19] and unrolled recurrent networks [20, 15] (Figure 3).

A training algorithm inspired by EM: We extend our work from [17] and provide a prescription for simultaneously training the filters and biases from the generative model that underlies CDL. This procedure is motivated by EM and Bayesian statistics. The encoder, mimics the sparse coding step in dictionary learning, and approximates the E-step of EM. The M-step is a two-stage back-propagation procedure (Figure 2). In the first stage, we perform back-propagation through the AE, a step that parallels the dictionary update step in dictionary learning. In the second stage, we perform back-propagation through the encoder using a loss function motivated by Bayesian statistics. This last step, which allows us to train the regularization parameter (ReLU bias) (Figure 7), does not have an equivalent in dictionary learning. We show in Section VII-C that training the bias with this EM-inspired approach results in a network that performs image denoising better than one that trains it using the conventional approach in deep learning, namely by minimizing a global reconstruction loss (Table VI).

An architecture that is interpretable and robust: We demonstrate that, when trained on data simulated from a generative model with known filters, this AE architecture motivated by CDL can a) successfully learn ground-truth filters (Figures 5 and 6), and b) is robust to noise at a range of signal-to-noise ratio (SNR) (Figure 4). That is, ours is an AE architecture for which the filters are interpretable as coming from a generative model. This is corroborated by experiments on natural images, absent in [17], which show that the architecture learns Gabor-like filters (Figure 12).

An application to spike sorting: We show that the architecture can perform source separation when the support of the sources is much smaller than the observed data. We demonstrate this ability by separating the activity of neurons in recordings of electrical activity from the brain of rats (Figure 9). The encoder from the architecture performs spike sorting, namely, identifies the location of action potentials from individual neurons in the recordings (Figure 10).

A demonstration of significant computational gains: We extend our work from [17] and demonstrate that this approach, which can readily employ off-the-shelf tools for training networks on GPUs, is $5$ x faster than a state-of-the-art CDL algorithm based on convex optimization (Figure 8), and performs spike sorting on hours of raw electrophysiological recordings $900$ x faster than existing optimization-based methods (Table IV, Spike Sorting).

A neural network for image denoising: We demonstrate that our framework can successfully denoise images, and rivals state-of-the-art methods with fewer parameters. In addition, we explore the effects of encoder depth, weight sharing between encoder and decoder, as well as different methods of training biases, on denoising performance (Table VI). This application was not demonstrated in [17].

III Generative Model

III-A Notation

We follow the notational conventions summarized in Table I.

We focus on signals with a one-dimensional domain, namely time. A generalization to signals such as images or videos is simply a matter of replacing one-dimensional convolutions with multi-dimensional ones.

III-B Continuous-time Model

Consider a set of continuous-time filters $(h_{c}(t))_{c=1}^{C}$ localized in time where $C$$\in$${\mathbb{N}}^{+}$ is the number of filters. Let $y(t)$ be the continuous-time signal that is the linear mixture of time-shifted and scaled versions of $(h_{c}(t))_{c=1}^{C}$ , and formally defined as

[TABLE]

where $v(t)\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(0,\sigma^{2})$ is additive noise, $N_{c}\in{\mathbb{N}}$ is the number of appearances of filter $c$ in the signal, and $x_{c,i}\in{\mathbb{R}}$ and $\tau_{c,i}\in{\mathbb{R}}^{+}$ encode, respectively, the amplitude and position of the $i^{\text{th}}$ appearance of filter $c$ in the signal.

III-C Discrete-time Model

Assuming $y(t)$ is sampled at the rate $f_{s}$ , its discrete-time version $y[n]$ is given by

[TABLE]

where $x_{c}[n]=\sum_{i=1}^{N_{c}}x_{c,i}\delta[n-n_{c,i}]$ and $\{h_{c}[n]\}_{n=0}^{K-1}$ is the discrete-time analog of $h_{c}(t)$ , for $c=1,\ldots,C$ . We can express Eq. (2) in linear-algebraic form as follows

[TABLE]

where $\mathbf{y}$$\in{\mathbb{R}}^{M}$ , and for $c=1,\ldots,C$ , $\mathbf{x}_{c}=[x_{c}[0],\ldots,x_{c}[M-K]]^{\text{T}}\in{\mathbb{R}}^{N_{e}=M-K+1}$ , $\mathbf{h}_{c}=[h_{c}[0],\ldots,h_{c}[K-1]]^{\text{T}}\in{\mathbb{R}}^{K}$ , $\mathbf{H}_{c}\in{\mathbb{R}}^{M\times N_{e}}$ is the matrix whose columns consist of delayed versions of the filter $\mathbf{h}_{c}$ . The matrix $\mathbf{H}\in{\mathbb{R}}^{M\times N_{e}C}$ is the concatenation of convolutional operators that, in practice, we do not need to store explicitly, and $\mathbf{x}\in{\mathbb{R}}^{N_{e}C}$ .

The model in Eq. (3) has promising applications in signal and image processing, where the signal of interest is the linear superposition of translated discrete-time filters or templates. Examples are dictionary learning [10] and image representation by learned features [21]. Similar to [21], we assume that the support of the filters are much smaller than the size of the signal ( $K$$\ll$$M$$=$$\left\lfloor\frac{T_{0}}{f_{s}}\right\rfloor$ (Assumption I)). In addition, in dictionary learning [10] or sparse coding problems [22], $\mathbf{x}$ is assumed to be a sparse vector (Assumption II), $\sum_{c=1}^{C}N_{c}\ll N_{e}C$ , resulting in $\mathbf{x}_{c}$ being an $N_{c}$ -sparse vector. Both of these are plausible assumptions in spike sorting [23], where $y[n]$ is a recording of extracellular voltage from the brain, the filters represent action potentials from an ensemble of neurons, and the firing rate of the neurons is assumed to be much smaller than the duration of the recording $M$ .

In the case of recordings from an array of sensors, we assume that each of the sensors follows the model of Eq. (1). In practice, due to memory constraints, we divide $\mathbf{y}$ into $J$ windows, each of size $N$ . As we detail in Section VII, this also lets us take advantage of batch-based methods for training neural networks. Following Assumptions I & II, we let $K\ll N\ll M$ . For each example $j$ , Eq. (3) becomes

[TABLE]

where $\mathbf{H}\in{\mathbb{R}}^{N\times N_{e}C}$ , $N_{e}$$=$$N$$-$$K$$+$$1$ , and $\mathbf{x}^{j}$ is $N_{c}$ -sparse.

IV Classical Dictionary Learning

Sparse coding problems assume that the data $\mathbf{y}^{j}\in{\mathbb{R}}^{N}$ can be represented as a linear combination of a few columns of the matrix $\mathbf{H}$ , termed a dictionary. Given a set of signals $\{\mathbf{y}^{j}\}_{j=1}^{J}$ and the filters $\mathbf{H}$ , sparse coding attempts to find the sparsest representation $\mathbf{x}^{j}$ by solving the following problem [22]

[TABLE]

$(P_{0})$ is a non-convex optimization problem. Basis pursuit [2] is a convex relaxation of $(P_{0})$ that instead solves

[TABLE]

In the case of a signal $\mathbf{y}^{j}=\mathbf{H}\mathbf{x}^{j}+\mathbf{v}^{j}$ observed in the presence of bounded noise $\mathbf{v}^{j}$ such that $\|\mathbf{v}^{j}\|_{2}\leq\epsilon$ , $(P_{1})$ can be extended as follows [24]

[TABLE]

Given the dictionary, $(P_{2})$ finds the sparse codes given noisy observations. In source separation problems such as spike sorting, the filters are unknown, and so must be learned along with the codes. CDL assumes that the dictionary $\mathbf{H}\in{\mathbb{R}}^{N\times N_{e}C}$ has block-Toeplitz structure and is a linear operator that performs convolution with the filters $\{\mathbf{h}_{c}\}_{c=1}^{C}$ . The goal is to find these filters by solving an extension of $(P_{2})$ given by

[TABLE]

where the constraint on the filters is to avoid scaling ambiguities. Solving for the sparse codes $\{\mathbf{x}^{j}\}_{j=1}^{J}$ and the filters $\{\mathbf{h}_{c}\}_{c=1}^{C}$ simultaneously in Eq. (8) is a non-convex optimization problem. The alternating-minimization method solves Eq. (8) in two stages: given an initial estimate of the filters, the algorithm alternates between a CSC step to estimate sparse codes $\{\mathbf{x}^{j}\}_{j=1}^{J}$ and a dictionary update step to estimate the filters $\{\mathbf{h}_{c}\}_{c=1}^{C}$ given the newly-estimated sparse codes [12]. In the case of dense (as opposed to convolutional) dictionary, [12] shows that the alternating minimization algorithm converges to the true dictionary whenever the dictionary satisfies RIP [25].

IV-A Convolutional Sparse Coding Update

Given the filters $\mathbf{H}$ , the CSC step is separable over the $J$ examples. We can solve for the $j^{\text{th}}$ sparse code $\mathbf{x}^{j}$ using the unconstrained form of $(P_{2})$ given by

[TABLE]

where $\lambda>0$ is a regularization parameter that depends on $\epsilon$ and encourages sparsity. State-of-the-art CSC algorithms use ADMM to solve Eq. (9) [26]. At present, convex optimization algorithms, such as ADMM, cannot be deployed easily on GPUs to enable the computation of the solution to Eq. (9) for all $J$ examples in parallel. In section VI, we introduce an architecture that allows us to solve the CSC problem for all $J$ examples in parallel using GPUs.

IV-B Convolutional Dictionary Update

Given the sparse codes, the filters are updated as follows

[TABLE]

Unlike the CSC step, the dictionary update is not parallelizable over the $J$ examples, which makes it computationally expensive. The work in [10] has proposed various methods to solve Eq. (10), of which we briefly describe two. The first solves Eq. (10) using ADMM by introducing a consensus variable [10, 13]. The second takes advantage of the symmetry of the convolution and solves a problem equivalent to Eq. (10) using gradient-based methods in the DFT-domain [10]. In both of these methods for updating the filters, the regularization parameter is treated as a hyperparameter to be tuned through cross-validation. In the next section, we propose an approach inspired by Bayesian statistics and EM to estimate both the filters $\mathbf{H}$ and regularization parameter $\lambda$ .

V Expectation-Maximization-inspired Dictionary Learning and Parameter Estimation

The objective is to estimate the sparse codes, filters, and regularization parameter. We explain how this is achieved using an approximation to the EM algorithm in a Bayesian generative setting. The approximate EM algorithm motivates the AE architecture proposed in section VI that is able to learn both the filters and regularization parameter in classical CDL.

V-A Bayesian Generative Model

In the remainder of the treatment, we drop the superscript $j$ from Eq. (4) to simplify the notation. This yields

[TABLE]

where we assume $\mathbf{v}\sim\mathcal{N}(\textbf{0},\sigma^{2}\mathbf{I})$ . Hence, given the sparse codes, the filters, and $\sigma^{2}$ , the data are distributed according to a multivariate Gaussian distribution. That is

[TABLE]

resulting in the data likelihood

[TABLE]

To encourage sparsity, we assume that conditioned on $\lambda$ , each sparse code $\mathbf{x}_{c}$ , $c=1,\ldots,C$ , is drawn according to distribution whose density is the product of one-dimensional (1D) $i.i.d.$ Laplace probability density functions. That is

[TABLE]

We assume the marginal prior on the filters $P(\mathbf{H})$ is non-informative, and that $\lambda$ follows a Gamma prior

[TABLE]

where $r$ , and $\delta$ are the shape and rate parameters of the density, respectively. Hence, $E[\lambda]=\frac{r}{\delta}$ .

We are interested in the so-called complete-data likelihood, i.e., the joint distribution $P(\mathbf{y},\mathbf{x},\mathbf{H},\lambda;\sigma^{2})$ under the prior

[TABLE]

assuming $\sigma^{2}$ is known, and a non-informative (flat) prior $P(\mathbf{H})$ on $\mathbf{H}$ .

The log of the complete-data likelihood is of the form

[TABLE]

where const. contains the terms that do not depend on $\mathbf{x}$ , $\mathbf{H}$ or $\lambda$ , including the flat prior on $\mathbf{H}$ .

V-B EM Algorithm

The EM algorithm is an algorithm to maximize the marginal log-likelihood $\log{P(\mathbf{y},\mathbf{H},\lambda;\sigma^{2})}$ with respect to (w.r.t.) $\mathbf{H}$ and $\lambda$ . Because the marginal likelihood is typically not available in closed form, EM introduces the hidden variable or missing data $\mathbf{x}$ and operates instead on the complete-data log-likelihood $\log{P(\mathbf{y},\mathbf{x},\mathbf{H},\lambda;\sigma^{2})}$ . The algorithm iteratively alternates between an E-step and M-step, until convergence.

E-step: Let $\theta=(\mathbf{H},\lambda)$ . At iteration $l$ , the E-step computes the function $\mathcal{Q}(\theta|\theta^{(l-1)})$ , which is the expectation of the complete-data likelihood w.r.t. the posterior of $\mathbf{x}$ given the data $\mathbf{y}$ and the parameters $\mathbf{H}^{(l\scalebox{0.75}[1.0]{$ - $}1)}$ and $\lambda^{(l\scalebox{0.75}[1.0]{$ - $}1)}$

[TABLE]

Let $\mathbf{x}^{(l)}$ be the maximum a posteriori (MAP) estimate of $\mathbf{x}$ at iteration $l$ , i.e., the posterior mode, given by

[TABLE]

Assuming the posterior at iteration $l$ is concentrated around its mode, i.e., $P(\mathbf{x}\mid\mathbf{y},\mathbf{H}^{(l\scalebox{0.75}[1.0]{$ - $}1)},\lambda^{(l\scalebox{0.75}[1.0]{$ - $}1)};\sigma^{2})\approx\delta(\mathbf{x}-\mathbf{x}^{(l)})$ , we can approximate Eq. (18) by evaluating the complete-data log-likelihood at the posterior mode

[TABLE]

Unless otherwise stated, the mode of the posterior is the sufficient statistics of the E-step, assuming the posterior concentrates around it. In Section VI, we argue that approximate EM gives a useful interpretation of AEs.

M-step: The M-step of the EM algorithm maximizes Eq. (20) w.r.t. $\mathbf{H}$ and $\lambda$ . Hence, given $\mathbf{x}^{(l)}$ , we update $\mathbf{H}$ and $\lambda$ as follows

[TABLE]

Both objectives are convex w.r.t. to the parameter of interest, and the updates are available in closed-form. We also note that $\mathbf{H}$ is a convolutional filter, and a constraint on the norm of the filters can be enforced similar to Eq. (8).

The E-step parallels the CSC step in CDL. Eq. (21) parallels the dictionary update step, while Eq. (22) does not have an analogue in CDL. In principle, given $J$ independent examples $\mathbf{y}^{j}$ , both the CSC step and the E-step are parallelizable. However, CDL and EM lack the infrastructure that would make it possible to solve the respective steps in parallel. The interpretation of CDL as EM motivates the AE we introduce in the next section. This architecture, called constrained recurrent sparse autoencoder (CRsAE), leverages the parallelism offered by GPUs to learn $\mathbf{H}$ by gradient descent at a fraction of the time required by state-of-the-art CDL algorithms. Besides, CRsAE is able to learn $\lambda$ , unlike existing CDL algorithms.

VI Proposed Network: Constrained Recurrent Sparse autoencoders

When we first introduced CRsAE [17], it was as an interpretable AE for CDL. The original architecture, shown in Figure 1 and which assumes known $\lambda$ , consists of an encoder, a decoder, and the norm-squared loss. It performs CDL by back-propagation through the AE to learn $\mathbf{H}$ . Here, we leverage the connection between CDL and EM to extend the architecture further in such a way that both $\mathbf{H}$ and $\lambda$ can be learned.

VI-A CRsAE Architecture for EM-inspired Dictionary Learning

CRsAE consists of an encoder, a decoder and two loss functions. We learn the filters $\mathbf{H}$ and the regularization parameter $\lambda$ through a two-stage back-propagation that uses one of the loss functions to update $\mathbf{H}$ and the other to update $\lambda$ . Figure 2 shows the EM-inspired CRsAE architecture.

VI-A1 Encoder, E-step and sparse coding

Given the filters $\mathbf{H}$ and the regularization parameter $\lambda$ , the goal of the encoder is to map inputs $\mathbf{y}$ into sparse codes. The forward pass of the encoder obtains the sparse codes as the solution to the E-step (Eq. (19)), which is equivalent to a CSC problem (Eq. (9)). This optimization problem is given by

[TABLE]

where, for notational convenience, we ignore the indices $l$ and $j$ from Eqs. (19) and (9), respectively.

CSC by ISTA: The ISTA algorithm [27] is a popular iterative method to solve Eq. (23). Let $\mathbf{x}_{t-1}$ represent the sparse code after $t$$-$$1$ iterations, ISTA updates $\mathbf{x}_{t-1}$ using the proximal gradient mapping

[TABLE]

where the constant $L\geq\sigma_{\text{max}}(\mathbf{H}^{\text{T}}\mathbf{H})$ , and the operator, $\mathcal{S}_{b}:{\mathbb{R}}^{N_{e}C}\to{\mathbb{R}}^{N_{e}C}$ is an element-wise two-sided ReLU

[TABLE]

also known as shrinkage or soft-thresholding in the signal processing literature [2]. The function $\textrm{sgn}(\cdot)$ applies element-wise to its input, and equals $1$ if the entry is positive, $-1$ if it is negative, and [math] otherwise. As detailed in [11], the ReLU arises because of the regularization on the $\ell_{1}$ -norm of $\mathbf{x}$ in Eq. (23). The ReLU is two-sided as the entries of $\mathbf{x}$ can be negative. When the entries cannot be negative, the operator becomes the ReLU nonlinearity commonly used in neural networks. By unfolding $T$ iterations of ISTA, we would obtain a neural network with ReLU nonlinearities that computes sparse codes given an input $\mathbf{y}$ . We build the encoder of CRsAE by unfolding $T$ iterations of FISTA [28], fast ISTA, to solve Eq. (23) (or equivalently the E-step of Eq. (19)). The sparsity of the encoder’s output is a consequence of the depth $T$ of the encoder and the shrinkage operator. For convolutional dictionaries, where the columns of $\mathbf{H}$ are highly correlated, the encoder must be deep to result in a sparse code at its output.

Fast ISTA: FISTA is an extension of ISTA that introduces a momentum term to accelerate convergence. Let $\mathbf{x}_{-1}=\mathbf{x}_{0}$ , and let the “state” vector $\mathbf{z}_{t}=\begin{bmatrix}\mathbf{z}_{t}^{(1)}\ \mathbf{z}_{t}^{(2)}\end{bmatrix}^{\text{T}}=\begin{bmatrix}\mathbf{x}_{t}\ \mathbf{x}_{t-1}\end{bmatrix}^{\text{T}}$ . Algorithm 1 shows the encoder of CRsAE. It implements the FISTA algorithm. The constant $L$ is as defined previously, and $\mathbf{x}_{t}$ represents the sparse code after $t$ iterations of FISTA.

Recurrent encoder by unfolding FISTA: Algorithm 1 defines a recurrence relation $\mathbf{z}_{t}=f(\mathbf{y},\mathbf{z}_{t-1},\mathbf{H})$ where $f(.)$ represents the operation in a single FISTA iteration. The encoder performs deep unfolding[29] and is built by unfolding this recurrent relation $T$ times, as depicted in the dashed box from Figure 1. Each step in the recurrence employs linear operators $\mathbf{H}$ and $\mathbf{H}^{\text{T}}$ and the non-linear shrinkage operator $\mathcal{S}_{b}$ (two-sided ReLU). In our convolutional setting, $\mathbf{H}:{\mathbb{R}}^{N_{e}C}\to{\mathbb{R}}^{N}$ is a linear mapping from encoded sparse codes space to the data space performing a sum of convolutions as in Eq. (4). $\mathbf{H}^{\textrm{T}}:{\mathbb{R}}^{N}\to{\mathbb{R}}^{N_{e}C}$ maps the input data into sparse code space while computing correlation between its input and filters $(\mathbf{h}_{c})_{c=1}^{C}$ . The output of the encoder is the code at the last iteration, namely $\mathbf{x}_{T}=\mathbf{z}_{T}^{(1)}$ . Similar to the network in [15], all the time steps of this recurrent neural network share the same input $\mathbf{y}$ , unlike classical recurrent architectures in which the input is fed in a sequence. This recurrent behaviour, shown as a for loop in Algorithm 1, allows us to produce approximately sparse codes, which are essential for dictionary learning [12].

Encoder and ResNet: The encoder in CRsAE resembles a deep residual network [19]. In fact, when unfolded, both ISTA and FISTA can be interpreted as variants of deep residual networks as also noted in [30]. Figure 3(a) shows one block of a residual network. As depicted in the figure, ISTA (Figure 3(b)) and FISTA (Figure 3(c)) each imposes a specific structure on the form of the residual mapping $\mathcal{F}(\mathbf{x})$ .

VI-A2 Decoder, M-step and loss functions

The M-step of EM from Section V dictates the form of the decoder, and the loss functions to update the trainable parameters. Specifically, the M-step suggests a two-stage back-propagation procedure (Figure 2) for updating $\mathbf{H}$ and $\lambda$ that is the subject of the next sub-section. Here, we argue that the two-stage training is the natural consequence of using Eqs. (21) and (22) to augment the encoder described previously.

M-step/ $\mathbf{H}$ update: The objective from Eq. (21) motivates a linear decoder, which applies $\mathbf{H}$ to the output of the encoder to reconstruct $\mathbf{y}$ , and the norm-squared as the loss to apply to the output of the decoder. The weights of the encoder and decoder are tied to each other. Hence, one branch of the AE, the one that will be used to update $\mathbf{H}$ , is constructed as

[TABLE]

In Figure 1, we refer to $\mathbf{z}_{T}^{(1)}$ as the sparse code (output of the encoder). Constraining the encoder and decoder makes the filters interpretable in terms of the model in Eq. (2) and also the dictionary learning optimization problem in Eq. (8). Moreover, it reduces the number of trainable parameters by a factor of $\frac{1}{3}$ compared to the AE in [14]. Eq. (21) suggests the following loss function, similar to [31, 16], to learn $\mathbf{H}$ :

[TABLE]

M-step/ $\lambda$ update: The objective function from Eq. (22) suggests that we use the loss function from Eq. (22) to update $\lambda$ . Therefore, we use the following to learn $\lambda$ :

[TABLE]

Together, Algorithm 1, Eqs. (26), (27) and (28) fully specify the CRsAE architecture and the losses we use to train it. The structure of CRsAE resembles that of a variational autoencoder (VAE) [32, 33]. VAEs are probabilistic AEs. The output of the encoder from a VAE is an approximate sample from the posterior distribution of the latent (e.g., $\mathbf{x}$ ) given the data. Section VIII discusses this connection and what a CRsAE-like VAE would resemble.

VI-B Back-propagation

We train CRsAE by a two-stage back-propagation procedure. Figure 2 illustrates the steps involved in this procedure. Minimizing the losses in this procedure corresponds to maximizing the expectation of the log complete-data likelihood using the approximate sufficient statistics, i.e., the MAP estimate of the codes, implicitly computed in the forward pass of CRsAE.

In the first stage, we perform back-propagation through the AE using the loss function from Eq. (27). Algorithm 2 is the back-propagation algorithm for computing the gradient of the loss function in Eq. (27) w.r.t. $\mathbf{H}$ . The output of this algorithm is the gradient of the loss function in Eq. (27), denoted by $\delta\cdot$ [20], w.r.t. the set of filters $\{\mathbf{h}_{c}\}_{c=1}^{C}$ shared across all layers. The number of trainable parameters is only $K\times C$ and is equal to the number of filter parameters in one layer of the network.

In the second stage, we perform back-propagation through the encoder using the loss function from Eq. (28). This latter stage is unique to our approach and does not have an equivalent either in dictionary learning, where $\lambda$ is typically not trained, or in deep learning. The back-propagation algorithm for computing the gradient of the loss function in Eq. (28) w.r.t. $\lambda$ is given in Algorithm 3. The term $-(N_{e}+r-1)C\log{\lambda}$ plays an important role in the successful estimation of $\lambda$ as it prevents its convergence to zero during learning. In other words, training $\lambda$ naïvely through back-propagation by only minimizing the first component of Eq. (28) may result in convergence of $\lambda$ to zero. If $\lambda$ were to converge to zero, the encoder would fail to produce sparse codes. This implies that, in the absence of the second term from Eq. (28), if the AE were fed simulated data generated by known filters $\mathbf{H}$ and sparse codes, the overall two-stage procedure would fail to learn $\mathbf{H}$ because the success of dictionary learning relies on the encoder producing sparse codes [12].

The derivations for Algorithms 2 and 3 are provided in detail in the Appendix. Algorithm 4 summarizes the two-stage back-propagation procedure, which, to our knowledge, is novel. Any gradient-based method can be used for training. For simplicity, Algorithm 4 uses gradient descent to learn the parameters. Table II summarizes the parallel between the steps in classical CDL, EM-inspired CDL, and CRsAE.

VI-C Training

We trained CRsAE through back-propagation on a GPU (NVIDIA Tesla V100) provided by AWS, using the ADAM optimizer [34]. When a validation set is available, we pick the parameters that minimize the validation loss over all epochs.

**Gamma hyper-prior parameters **( $\delta$ , r): Given a dense dictionary, a suggested value for $\lambda$ motivated by theory [2] is $\frac{\sqrt{2\log(C\times N_{e})}}{\sigma}$ . We set the parameters of the gamma prior on $\lambda$ such that the prior is centered around the suggested estimate. In CDL, we do not have access to the true filters, which results in an additional source of noise: $\mathbf{y}=\mathbf{H}_{0}\mathbf{x}+\underbrace{(\mathbf{H}-\mathbf{H}_{0})\mathbf{x}+\mathbf{v}}_{\mathbf{v}_{0}}$ where $\mathbf{v}_{0}$ contains observation noise and noise from the lack of knowledge of $\mathbf{H}$ . Hence, it is reasonable to expect $\lambda$ to deviate from the suggested estimate during training. To allow for this, we chose the $\delta$ such that the distribution is wide enough for $\lambda$ to deviate from its mean during training while maintaining stability by avoiding the convergence of $\lambda$ to 0 or $\infty$ . For the simulated data from Section VII-A, $\sigma$ is known. For the real data from Section VII-B, we can estimate $\sigma$ from “silent” periods in the signal.

Encoder hyperparameters ( $L$ , $T$ ): We choose $L$ to be greater than $\sigma_{\text{max}}(\mathbf{H}^{\textrm{T}}\mathbf{H})$ [28]. For neural data, we can estimate $L$ from an existing collection of action potentials. The value of $T$ does not affect the trainable parameters as the encoder is a recurrent network with shared parameters. However, for the encoder to produce sparse codes, $T$ should be large.

Optimizer hyperparameters ( $\eta_{\mathbf{h}}$ , $\eta_{\lambda}$ , $B$ ): The learning rates ( $\eta_{\mathbf{h}}$ for $\mathbf{h}$ and $\eta_{\lambda}$ for $\lambda$ ) of the optimizer depend on the smoothness of the loss function, which is a function of the model and dataset. For the filters, we found an optimal $\eta_{\mathbf{h}}$ range by varying it from $10^{-5}$ to $10^{-1}$ while monitoring the validation loss. As suggested in [35], in our 1D experiments, we picked the optimal $\eta_{\mathbf{h}}$ as the one that leads to the sharpest drop in the validation loss. We tuned the $\eta_{\lambda}$ to a value in the range between $0.1$ and $5$ . We chose $\eta_{\lambda}$ to be large so that the effective learning rate for the bias $b=\lambda\sigma^{2}/L$ is appropriate for training, i.e., not too small. In addition, choosing a larger $\eta_{\lambda}$ compared to $\eta_{\mathbf{h}}$ allows the value of $\lambda$ to stabilize faster than $\mathbf{H}$ within each epoch. As discussed in Section VII, we found that this behavior was crucial in allowing the training to converge to the ground-truth filters $\mathbf{H}$ .

The batch size $B$ corresponds to the number of examples used in every gradient update step of back-propagation. For the time-series dataset consisting of recordings of extracellular voltage from neurons, we found that it was important to relate $B$ to the expected number of spikes in each batch. We can estimate the expected number of spikes in a window, given its length $N$ and the firing rate of neurons. In turn, this calculation yields the expected number of spikes in each batch. In our experiments, we chose $B$ so as to have enough spikes in each mini-batch but still have a large number of batches to take advantage of stochasticity during training.

Augmentation: We used two forms of augmentation for 1D experiments: flipping and circular rotation, which do not change the generative model (Eq. (2)), but are useful as the training is done through mini-batch gradient descent on noisy data [36]. In our model, augmentation by flipping changes the sign of the sparse codes and does not increase the complexity of the model, as the encoder uses two-sided shrinkage (Eq. (25)), i.e., as opposed to ReLU [37]. In augmentation by circular rotation, the sparse code positions are shifted. This augmentation is done by circularly rotating each example by an integer delay randomly generated between $1$ to $N$ .

VII Experiments

In this section, we apply CRsAE111https://github.com/ds2p/crsae to two different 1D datasets and one two-dimensional (2D) dataset. The first 1D dataset consists of simulated recordings of extracellular voltage from neurons with known ground-truth filters. We use this example to compare the ability of CRsAE to learn, in an unsupervised manner, the ground-truth filters to the AE called learned convolutional sparse coding (LCSC) from [14]. The encoder of LCSC uses $3$ ISTA iterations. Unlike in CRsAE, the decoder is unconstrained, i.e., it is not tied to the encoder. In LCSC, each filter is associated with its own regularization parameter. The overall objective of LCSC is to minimize the least-squares (LS) reconstruction loss from input to output. We also use the simulated example to compare the computational efficiency of CRsAE to the state-of-art optimization-based CDL algorithm from [10], which is implemented in the Sporco library [38] (we refer to this method by Sporco). Sporco solves the CSC update through ADMM [26], and solves the dictionary update by an accelerated proximal gradient method.

The second 1D dataset consists of recordings of extracellular voltage from neurons in the Hippocampus of a rat [39]. We use this dataset to compare CRsAE to continuous basis pursuit (CBP) [40], the state-of-the-art optimization-based algorithm for spike sorting, which is the process of identifying the location of action potentials (sparse codes) and their corresponding neurons in extracellular recordings.

The third dataset is a dataset of natural images. We use it to compare CRsAE to LCSC on an image denoising task. In this task, we study the effect of encoder depth, weight sharing between encoder and decoder, as well as the method for training the regularization parameter, on denoising performance.

VII-A Simulated Data

In this experiment, we demonstrate the ability of CRsAE to recover, in an unsupervised setting, filters that appear in a 1D dataset. We use the following standard error [12]

[TABLE]

where $\hat{\mathbf{h}}_{c}$ is one of the learned filters, to compare the learned filters to the true ones. This error ranges from [math] to $-\infty$ . The closer the learned filter to the true one, the smaller $\text{err}(\mathbf{h}_{c},\hat{\mathbf{h}}_{c})$ . We characterize the sensitivity of CRsAE to noise by computing this error for a range of SNR values.

Simulated extra-cellular data: We simulated a recording (Eq. (1)) of length approximately $T_{0}=17$ minutes consisting of the sum electrical-voltage activity from $C=4$ neurons, each with an average firing rate of $30$ Hz. The unit of the recording was in mV. Consistent with the biophysics of neurons [23], we picked filters (action potentials) of length $1.8$ ms. We chose the amplitudes $\{x_{c,i}\}_{i=1}^{N_{c}}$ and times of occurrence $\{\tau_{c,i}\}_{i=1}^{N_{c}}$ of the filters independently for each neuron, respectively, according to a $\mathcal{N}(180,30)$ distribution and a Poisson process with rate $30$ Hz. The cross-correlation between the filters ranged from $0.5$ to $0.95$ .

We divided the voltage recording into windows of length $100$ ms, resulting in a total number $J=10{,}100$ windows (examples). Assuming a sampling rate of $f_{s}=10$ kHz, the length of each window was $N=1{,}000$ , and each filter was $K=18$ samples long. Therefore, $\mathbf{y}^{j}\in{\mathbb{R}}^{1,000}$ and $\mathbf{h}_{c}\in{\mathbb{R}}^{18}$ . For each example $j$ , each of the vectors $\{\mathbf{x}_{c}^{j}\}_{c=1}^{C}$ was $3$ -sparse. Because of the refractory period of neurons, whose length is on the same order as the filter length, occurrences of filters from a given neuron cannot overlap within the signal. This implies that, in each sparse code $\mathbf{x}_{c}^{j}$ , the non-zero entries were at least $K$ samples apart. Filters from different neurons were allowed to overlap [23]. We added Gaussian noise $\mathbf{v}^{j}\sim\mathcal{N}(\textbf{0},\sigma^{2}\mathbf{I})$ to each example $\mathbf{y}^{j}$ , where the noise variance $\sigma^{2}$ was set according to a specified SNR. Finally, as is standard practice [41], we normalized the dataset by its maximum absolute value. Note that this scaling did not affect the generative model, as all the examples were re-scaled by the same constant. The first column of Table III summarizes the data and training parameters of the simulation. The depth of the network is $T+1=181$ layers, and the number of channels at every layer is $C=4$ , the number of neurons. The width of each channel is $N-K+1=983$ .

SNR analysis: We simulated data at four different equally-spaced SNRs in a log-scale ranging from $7$ to $16$ dB. We trained CRsAE and computed the measure from Eq. (29), which quantifies its ability to recover the filters used in the simulation, as a function of SNR. For training, we initialized the filters $\{\mathbf{h}^{\text{init}}_{c}\}_{c=1}^{C}$ by adding random Gaussian noise to the true filters $\{\mathbf{h}_{c}\}_{c=1}^{C}$ such that $\text{err}(\mathbf{h}_{c},{\mathbf{h}}^{\text{init}}_{c})$ was between $-3$ to $-4$ for all the filters. Figure 4 depicts $\text{err}(\mathbf{h}_{c},\hat{\mathbf{h}}_{c})$ as a function of SNR. The error was averaged over $20$ independent experiments and is only shown for $c=4$ as the curves for all the other filters were similar. The vertical line presents the variance of the error, where we can see that the variance of CRsAE error is very small. The figure demonstrates that, compared to LCSC, CRsAE is able to recover the underlying filters and does so in a manner that is robust to noise. That is, CRsAE performs well for various level of the SNR and, as SNR increases, CRsAE learns filters that are closer to the true underlying ones. Interestingly, in spite of the fact that LCSC was able to reconstruct the noisy data very well, it failed to learn the underlying filters. Moreover, the filters that LCSC learned rarely correlated with the true filters as measured by Eq. (29). CRsAE, on the other hand, is able to perform denoising and learn the true filters. We attribute the success of CRsAE to two factors. First, it is well-known in the dictionary-learning literature [12] that, given a noisy initial dictionary, the success of dictionary learning depends on the ability of the sparse coding step to produce sparse codes that are close to those that generated the examples. In CRsAE, we unfold for $T=180$ iterations, compared to LCSC that only ran on the order of $<10$ iterations of ISTA. That’s why the CRsAE encoder yielded sparse codes close to the true ones, while LCSC did not. Second, the fact that the encoder and decoder weights are tied in CRsAE means that it has fewer parameters to learn than LCSC with the same amount of data.

Dictionary learning: Figure 5 shows a plot of the distance error from Eq. (29) as a function of the number of epochs for one of the experiments from Figure 4 with $16$ dB SNR. After $3$ to $4$ epochs, the distance between all learned and true filters decreases to less than $-14$ . The learned dictionary corresponds to epoch 8, where the validation loss is minimized. The fast convergence of CRsAE is likely due to a) the low number of parameters to learn due to the sharing of parameters between the encoder and the decoder, and b) a large number of training examples. In the figure, the fact that the error increases slightly after it reaches a minimum is likely the effect of noise due to the stochastic nature of mini-batch gradient descent.

Figure 6 shows the filters corresponding to the error plots from Figure 5. In Figure 6, gray color (denoted by $\bigtriangledown$ ) indicates the initial filters. The true filters are shown in black. CRsAE is able to learn filters (red color denoted by $\star$ ) that are indistinguishable from the true filters, but LCSC (denoted by $\circ$ and depicted in green) fails the task. Together, Figures 4, 5 and 6 demonstrate the ability of CRsAE to perform convolutional dictionary learning, and its robustness to noise.

Learning the regularization parameter (ReLU bias): The EM-inspired CRsAE architecture lets us automatically train and learn the regularization parameter using noisy data. Figure 7 shows the values of $\lambda$ for CRsAE as a function of epochs in a simulation with SNR of $16$ dB. The figure shows that $\lambda$ for CRsAE converges after $4$ to $5$ epochs. We initialized $\lambda$ as detailed in Section VI-C to a value of $168.04$ . The learned $\lambda$ converged to $187.68$ . For LCSC, $\lambda$ for all filters converged to a very small value, resulting in over-fitting the noise in the dataset. We note that for a fair comparison, we trained LCSC using the same reconstruction loss as CRsAE. However, even if we had optimized the LCSC loss function, $\lambda$ would have behaved the same due to the absence of the regularization on the parameter as in the Bayesian loss function from Eq. (28), specifically the second term.

Speed analysis: We compared the speed of CRsAE applied to CDL with Sporco. For fairness of comparison, we compared the speeds given a fixed desired dictionary learning accuracy.

We used simulated data at $16$ dB SNR and set $\lambda$ to the learned value from Section VII-A. We used the same regularization parameter ( $\lambda\sigma^{2}=187.68\ (0.024)^{2}\approx 0.11$ for SNR of $16$ dB) for Sporco. Other hyperparameters of Sporco $(\rho,L)$ were tuned to $(4,10^{5})$ by random grid-search guided by the hyperparameters used in [10]. We ran CRsAE and Sporco for 30 and 200 iterations, respectively. Here, $1$ iteration refers to going over all of the dataset, i.e., an epoch in neural-networks terminology. Figure 8(a) plots the signal reconstruction error for CRsAE and Sporco as a function of time. The figure shows that CRsAE performs CDL $5$ x faster than Sporco. Indeed, CRsAE took $69.27$ s to train, and Sporco reached the same accuracy as CRsAE in $319.52$ s. Figure 8(b) shows that CRsAE took $7$ epochs to train, and the training loss reached its minimum in $10$ epochs (denoted by $\bigtriangledown$ ). Sporco took $89$ iterations to reach the same accuracy/loss as CRsAE. The first row of Table IV summarizes the speed comparison between CRsAE and Sporco. We chose to plot the signal reconstruction error, as opposed to Eq. (29), to demonstrate that CRsAE can also perform the signal reconstruction. Both CRsAE and Sporco converged to the filters in Figure 6.

VII-B Spike Sorting: Real Data

Extracellular data from rat Hippocampus: The dataset consists of extracellular tetrode recordings from the rat Hippocampus with simultaneous intracellular recording. The intracellular recording provides the ground-truth data that is used to assess the ability of CRsAE to identify spikes successfully. The reader can find additional details about these data in [39]. The goal is to sort the spikes that appear in the extracellular recordings. Spike sorting is an important source-separation problem in computational neuroscience [23]. It refers to the process of identifying the time of occurrence of action potentials (filters) in an extracellular voltage recording and their assignment to a neuron (source). Assuming the extracellular data follows the model of Eq. (2), we cast spike sorting as a CDL problem and apply CRsAE to it.

Pre-processing: We high-passed filter the raw data (Channel 0) at $400$ Hz and whiten the noise. To minimize boundary effects, we picked a large window length of $6$ s, resulting in a total of $24$ windows (examples). We set the length of the filters to be learned to $3.5$ ms. The sampling rate of the extracellular data was $f_{s}$$=$$10$ kHz. Therefore, each window had length $N$$=$$60{,}000$ samples, and each filter had length $K$$=$$35$ samples. In the recording, there were mainly two neurons spiking [39], so we set $C$$=$$2$ . The data were divided by the maximum absolute value of the recording. Prior to normalization, we estimated the standard deviation of the noise from “silent” periods in the recording, yielding a value of $1$ in fractions of mV. We do not specify the units more precisely because the data are made available to us in an already-normalized form. Following normalization by the maximum absolute value of the signal, we tuned $\sigma$ to a value of $0.03$ . For training, we split the data into $21$ examples for training and $3$ examples for validation. To improve the learning performance, the training set was tripled by data augmentation (see Section VI-C), resulting in $63$ examples. The second column of Table III summarizes this dataset and the parameters used for training. The depth of the network is $T+1=601$ layers, and the number of channels at every layer is $C=2$ , the number of neurons. The width of each channel is $N-K+1=59{,}966$ .

Dictionary learning: To initialize the filters, we used the following standard procedure [40]. We spotted the location of potential spikes using a pre-computed threshold and windowed the data around the spotted positions. Then, we computed the two first principal components of the matrix whose rows comprise the obtained windows. Finally, we performed k-means clustering on the windows in principal-component space and picked the initial filters as the center of the clusters. We initialized $\lambda$ to $161.21$ , and it converged to $219.25$ upon training. Figure 9 shows the filters before and after training CRsAE. The initial and learned filters are in gray (denoted by $\triangledown$ ) and in red (denoted by $\star$ ), respectively. The figure shows that CRsAE is able to learn filters that resemble the action potentials of a neuron. These filters are interpretable, in the sense that they are those that allow us to best represent the data in the form of the convolutional generative model of Eq. (2).

Spike sorting: In this experiment, we did not have access to the ground-truth filters (true action potentials) from the extracellular recordings. However, the intracellular voltage recording was available for the neuron corresponding to the filter $\mathbf{h}_{1}$ . We used the intracellular data to assess how well CRsAE can perform spike sorting. Unlike traditional methods for spike sorting that use principal component analysis to identify distinct templates within the data, CRsAE uses raw extracellular data. We compared CRsAE to CBP. Unlike CRsAE, CBP does not perform CDL: it simply performs CSC to identify the time of occurrence of filters given the filters.

We quantify the ability of an algorithm to perform spike sorting by comparing the spike trains it estimates to the true spikes (sparse codes). The true spikes are provided through intracellular voltage recordings. Similar to [40], for a given threshold, we compute the proportion of true (intracellular) spikes that are not identified correctly by CRsAE (true miss), as well as the proportion of spikes identified by CRsAE for which there is no true matching spike (false alarm). In practice, to identify a spike, given the neuron of interest $c$ , we reconstruct the component of the data corresponding only to the neuron of interest. Then, we use a threshold and identify any non-zero spiking activity greater than the threshold to be a spike. A low threshold value results in identifying background noise or other artifacts as spikes (false alarm), whereas a high threshold may result in missing spikes (true miss).

Figure 10 shows the trade-off between the false alarm proportion and the true miss proportion as a function of threshold and highlights the competitive performance of CRsAE compared to CBP. We also compared the speed of CRsAE to that of CBP for spike sorting of the simulated data lasting $17$ minutes. CRsAE learned the filters in $69.27$ s and performed spike sorting in only $0.93$ s. The implementation of CBP at our disposal is not efficient enough to be applied to a $17$ -minute-long recording. We estimated that CBP would take $17$ hours to perform spike sorting given the filters. We estimated this timing by running CBP on $1$ , $2$ , $4$ , and $8$ s of the data and observed a linear increase in the timing CBP needed for spike sorting [40]. The second row of Table IV compares the speed of the two algorithms of the two algorithms applied to spike sorting.

VII-C Image Denoising

We trained CRsAE to denoise images from the PASCAL VOC image dataset [42], consisting of $J=5{,}717$ training images. We used a set of $9$ commonly used test images for evaluation purposes, where the peak signal-to-noise-ratio (PSNR) is used as the evaluation criterion. We trained CRsAE in supervised settings. The supervised setting assumes access, during training, to pairs that consist of a noisy and the associated clean image.

We learned $C=64$ filters, each of size $7$$\times$$7$ . We used convolutions with a stride of $5$ pixels in each dimension. In order to capture the patterns appearing in shifts missed by the strides, we followed a similar approach to [31] for augmentation and reconstruction. In total, we compared seven networks. We considered three networks to compare different methods of training the bias. CRsAE- $\lambda_{\text{EM}}$ learned $\lambda\in{\mathbb{R}}^{64}$ (one for each filter) through the EM-inspired approach. CRsAE- $\lambda_{\text{hyp}}$ treated $\lambda$ as a hyperparameter and tuned it by grid search (we used the estimate $\lambda=\frac{\sqrt{2\log(C\times N_{e})}}{\sigma}$ [2] and tuned $\sigma$ by grid search in a neighborhood of the noise level of the noisy dataset). CRsAE- $\lambda_{\text{LS}}$ learned $\lambda$ using the standard approach in deep learning, namely by minimizing the global reconstruction loss [14, 31, 30]. For the cases in which $\lambda$ was tuned, we absorbed $\sigma^{2}$ into $\lambda$ when reporting numbers. Similarly, for the case in which $\lambda$ was trained through a reconstruction loss, we absorbed $\sigma^{2}$ and $L$ into $\lambda$ .

We also studied the effect of learning $\lambda$ when the weights of encoder/decoder are not tied, similar to [14]. We call the resulting architectures CRsAE- $\lambda_{\text{EM}}$ -untied and CRsAE- $\lambda_{\text{LS}}$ -untied. For both cases, the encoder had $T=30$ layers. We studied the effect of the depth of encoder unfolding by training an architecture, CRsAE- $\lambda_{\text{EM}}$ -shallow, for which $T=3$ . At last, to emphasize the significance of the method we propose to train $\lambda$ , we trained CRsAE- $\lambda_{\text{EM}}$ -untied to minimize a combination of the $\ell_{1}$ and multiscale structural similarity [43] (MS-SSIM) losses, similar to [14]. We call this network CRsAE- $\lambda_{\text{EM}}$ -untied-msssim. In this case, the main difference between LCSC and CRsAE is the training procedure.

We initialized the filters using i.i.d. draws from a standard Gaussian random variable. We trained the network for $200$ epochs. At every iteration, we cropped a $128\times 128$ patch from the training image and added random Gaussian noise with standard deviation $\frac{20}{255}$ . We set the initial learning rate $\eta_{\mathbf{h}}$ to $10^{-2}$ and decreased it by a factor of $0.7$ every $10$ epochs. We set $\eta_{\lambda}$ to $0.1$ for EM and $0.01$ for the LS method. We summarize the parameters used for training in Table V.

Table VI presents the performances on the denoising task.

Encoder depth improves denoising: Comparing CRsAE- $\lambda_{\text{EM}}$ -shallow and CRsAE- $\lambda_{\text{EM}}$ suggests that having a deep encoder can significantly improve performance–in this case, by $1$ - $1.5$ dB in PSNR.

EM-inspired training improves denoising: We can draw three conclusions by comparing CRsAE- $\lambda_{\text{hyp}}$ , CRsAE- $\lambda_{\text{LS}}$ , and CRsAE- $\lambda_{\text{EM}}$ . First, training $\lambda$ by grid search results in poorer performance compared to training it using the standard approach or the EM-inspired approach. Second, and more importantly, the EM-inspired method outperforms the standard approach. This behaviour persists for the untied case as well. Indeed, the performance of CRsAE- $\lambda_{\text{EM}}$ -untied-msssim rivals that of LCSC [14], which trains $\lambda$ by minimizing a global reconstruction loss.

Depth and weight-tying affect learned dictionary: Figure 11 depcits the result of denoising the test image (Lena) using CRsAE- $\lambda_{\text{EM}}$ -shallow and CRsAE- $\lambda_{\text{EM}}$ -untied-msssim. Figure 12 shows the filters that each of the networks learned. The figure demonstrates that a network with a deep encoder, and for which the encoder and decoder share weigths, is better able to learn filters that are Gabor-like edge detectors [44], compared to a network with a shallow encoder or a network whose encoder and decoder do not share weights.

VIII Conclusion

Training an autoencoder for convolutional sparse coding to result in very low input/output prediction error is not a challenging task and has been done before [14]. The challenging problem is to demonstrate that the weights learned by an autoencoder are interpretable as convolutional filters in the context of convolutional dictionary learning or source separation, a problem that, to our knowledge, existing autoencoder architectures for convolutional sparse coding have not solved. In this paper, we framed convolutional dictionary learning as training of an autoencoder and proposed the constrained recurrent sparse autoencoder (CRsAE). CRsAE is the first autoencoder architecture to learn interpretable filters corresponding to a generative model. We highlighted this ability of CRsAE and its robustness to noise through simulation. CRsAE owes it success at performing dictionary learning and source separation to a) the deep recurrent and residual neural network architecture, inherited from FISTA, of its encoder, b) the tying of the weights to be learned in the encoder and decoder, and c) the Expectation Maximization-inspired training of the regularization parameter to ensure the sparseness of codes output by the encoder. The tying constraint on the encoder and decoder reduces the trainable parameters by a factor of $\frac{1}{3}$ . In addition, the recurrent architecture of CRsAE with shared weights allows us to produce sparse representations as the depth increases with a fixed number of trainable parameters.

We highlighted the similarities and differences of methods such as classical dictionary learning, Expectation Maximization, and CRsAE for learning sparse representations from data. The architecture for CRsAE and the training procedure are motivated by the Expectation Maximization algorithm. The encoder in CRsAE produces sparse codes, a similar step to the sparse coding update in dictionary learning, and the E-step of Expectation Maximization. CRsAE learns the filters and the regularization parameter using a two-stage back-propagation procedure, a step corresponding to the M-step in Expectation Maximization. In the first stage, the filters of interest are trained by back-propagation through the autoencoder. This step parallels the dictionary update step in convolutional dictionary learning. In the second stage, the regularization parameter is updated by back-propagation through the encoder using a loss function motivated by Bayesian statistics. This two-stage back-propagation and the loss function for training the regularization parameter are the keys to ensuring the sparseness of the codes, output by the encoder, and preventing the convergence of the regularization parameter (bias) to zero.

CRsAE applies to the case when the posterior concentrates around the MAP estimate, an assumption that is inherited from its parallel with the alternating-minimization algorithm for dictionary learning. An interesting line of work is to extend CRsAE to cases when this assumption fails. In such cases, the output of the encoder ought to approximate samples from the posterior distribution of the codes. The resulting architecture would resemble variational autoencoders [32, 33], in the sense that the latent representation is random. Moreover, the encoder and decoder would be tied as in CRsAE.

We benchmarked CRsAE and showed that it performs dictionary learning $5$ x faster than the state-of-the-art algorithm [10] for convolutional dictionary learning based on alternating-minimization. We showed that the encoder of CRsAE could identify the location of action potentials in extracellular voltage recordings from the brain of rats. In particular, we showed that CRsAE could perform spike sorting as well as the state-of-the-art algorithm based on convex optimization known as continuous basis pursuit. At the same time, compared to continuous basis pursuit, CRsAE significantly speeds up the process of spike sorting by $900$ x.

Finally, we demonstrated the performance of CRsAE on the task of image denoising. We showed that, when learning the regularization parameter through the proposed EM-inspired approach, CRsAE denoises images better than networks that train the regularization parameter (ReLU bias) by minimizing a global reconstruction loss, the standard approach in deep learning. We also showed that the performance of our method rivals that of LCSC [14]. At last, we highlighted that both a deep encoder and the sharing of weights between the encoder and decoder promote learning of Gabor-like filters [44].

Acknowledgments

The authors gratefully acknowledge supports by NSF Simons Center for Mathematical and Statistical Analysis of Biology at Harvard University (grant no. DMS-1764269), the Harvard FAS Quantitative Biology Initiative. This research is also supported by AWS Machine Learning Research Awards.

Here, we derive the back-propagation algorithm for CRsAE. This derivation is only for completeness. In practice, the gradients are computed through autograd function while training. We assume, without loss of generality, number of examples $J=1$ . Let $\mathbf{h}=[\mathbf{h}_{1}^{\text{T}},\ldots,\mathbf{h}_{C}^{\text{T}}]^{\text{T}}$ where $\mathbf{h}_{c}=[h_{c}[0],h_{c}[1],\ldots,h_{c}[K-1]]^{\text{T}}$ , and $\mathbf{w}_{t}=[\mathbf{w}_{t,1}^{\text{T}},\ldots,\mathbf{w}_{t,C}^{\text{T}}]^{\text{T}}$ where $\mathbf{w}_{t,c}=[w_{t,c}[0],w_{t,c}[1],\ldots,w_{t,c}[N_{e}-1]]^{\text{T}}$ . Let $\mathbf{c}_{T+1}=\hat{\mathbf{y}}=\mathbf{H}\mathbf{c}_{T}$ . The gradient of the loss function in Eq. (27) from the main paper is

[TABLE]

where

[TABLE]

To evaluate Eq. (31), we first compute $\frac{\partial\mathbf{z}_{t}}{\partial\mathbf{c}_{t}}$ as

[TABLE]

where $b=\frac{\lambda\sigma^{2}}{L}$ . Then, we evaluate $\delta\mathbf{z}_{t-1}$ through the following recursion

[TABLE]

Having $\mathbf{c}_{T+1}=\hat{\mathbf{y}}$ , we initialize the recursion with

[TABLE]

We then evaluate $\frac{\partial\mathbf{c}_{t}}{\partial\mathbf{z}_{t-1}}$ needed for Eq. (33) as

[TABLE]

Finally, to finish the full gradient propagation procedure, we compute $\frac{\partial\mathbf{c}_{t}}{\partial\mathbf{h}}$ needed for Eq. (30) as

[TABLE]

where

[TABLE]

and for $t=1,\ldots,T$ ,

[TABLE]

for $a,b=1,\ldots,C$ , $m=1,\ldots,K$ , and $n=1,\ldots,N_{e}$ . $c_{\mathbf{h}_{b}\mathbf{h}_{g}}[k]$ is the deterministic cross-correlation function between $\mathbf{h}_{b}$ and $k$ -delayed samples of $\mathbf{h}_{g}$ as follows

[TABLE]

Next, we derive the back-propagation algorithm for CRsAE for training $\lambda$ . Without loss of generality, we assume $\sigma=1$ . The gradient of the loss function in Eq. (28) from the main paper w.r.t. $\lambda$ is

[TABLE]

We evaluate $\frac{\partial\|\mathbf{z}_{T}^{(1)}\|_{1}}{\partial\mathbf{z}_{t-1}}$ through the following recursion

[TABLE]

We initialize the recursion with

[TABLE]

where $\frac{\partial\|\mathbf{z}_{T}^{(1)}\|_{1}}{\partial\mathbf{z}_{T}}\in{\mathbb{R}}^{2N_{e}C\times 1}$ (Note that for when $\mathbf{z}_{T}[i]=0$ , we have used the sub-gradient as the gradient is not defined). Then we evaluate

[TABLE]

where $\frac{\partial\mathbf{z}_{t}}{\partial\mathbf{c}_{t}}$ and $\frac{\partial\mathbf{c}_{t}}{\partial\mathbf{z}_{t-1}}$ are derived previously in the derivation of back-propagation for the filters.

Finally, to finish the full gradient propagation procedure, we compute $\frac{\partial\mathbf{z}_{t}}{\partial\lambda}\in{\mathbb{R}}^{1\times 2N_{e}C}$ needed for Eq. (40) from $\mathbf{z}_{t}^{(1)}=\mathcal{S}_{b}(\mathbf{c}_{t})$

[TABLE]

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Chen, S. A. Billings, and W. Luo, “Orthogonal least squares methods and their application to non-linear system identification,” International Journal of Control , vol. 50, no. 5, pp. 1873–96, 1989.
2[2] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM Review , vol. 43, pp. 129–59, 1998.
3[3] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological) , vol. 58, no. 1, pp. 267–88, 1996.
4[4] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society: Series B (Methodological) , vol. 39, no. 1, pp. 1–22, 1977.
5[5] X. F. Zhu, B. Li, and J. D. Wang, “L 1-norm sparse learning and its application,” Applied Mechanics and Materials , vol. 88-89, pp. 379–85, 2011.
6[6] T. Park and G. Casella, “The bayesian lasso,” Journal of the American Statistical Association , vol. 103, no. 482, pp. 681–86, 2008.
7[7] S. Mallat, “A theory for multiresolution signal decomposition: The wavelet representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 11, pp. 674–693, 1989.
8[8] K. Engan, S. Aase, and J. Hakon Husoy, “Method of optimal directions for frame design,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (ICASSP) , vol. 5, 1999, pp. 2443–46.