General Convolutional Sparse Coding with Unknown Noise

Yaqing Wang; James T. Kwok; and Lionel M. Ni

arXiv:1903.03253·cs.LG·April 22, 2020

General Convolutional Sparse Coding with Unknown Noise

Yaqing Wang, James T. Kwok, and Lionel M. Ni

PDF

TL;DR

This paper introduces a convolutional sparse coding model that effectively handles unknown, complex noise by modeling it with Gaussian mixture models, improving robustness and representation quality in noisy data.

Contribution

It proposes a novel CSC framework using Gaussian mixture models for noise, with an efficient EM-based solution and simultaneous dictionary and code updates.

Findings

01

Effective modeling of complex noise in biomedical data

02

Comparable computational complexity to existing CSC methods

03

High-quality filters and representations achieved

Abstract

Convolutional sparse coding (CSC) can learn representative shift-invariant patterns from multiple kinds of data. However, existing CSC methods can only model noises from Gaussian distribution, which is restrictive and unrealistic. In this paper, we propose a general CSC model capable of dealing with complicated unknown noise. The noise is now modeled by Gaussian mixture model, which can approximate any continuous probability density function. We use the expectation-maximization algorithm to solve the problem and design an efficient method for the weighted CSC problem in maximization step. The crux is to speed up the convolution in the frequency domain while keeping the other computation involving weight matrix in the spatial domain. Besides, we simultaneously update the dictionary and codes by nonconvex accelerated proximal gradient algorithm without bringing in extra alternating loops.…

Tables6

Table 1. TABLE I: Comparing the proposed GCSC with existing CSC algorithms.

method	convolution operation	noise modeling	algorithm
DeconvNet [5]	spatial	Gaussian	BCD, updating codes and dictionary by gradient descent
ConvCod [19]	spatial	Gaussian	BCD, updating dictionary by gradient descent and codes by encoder
FCSC [20]	frequency	Gaussian	BCD, updating codes and dictionary by ADMM
FFCSC [7]	frequency	Gaussian	BCD, updating codes and dictionary by ADMM
CBPDN [21]	frequency	Gaussian	BCD, updating codes and dictionary by ADMM
CONSENSUS [22]	frequency	Gaussian	BCD, updating codes and dictionary by ADMM
SBCSC [23]	spatial	Gaussian	BCD, updating dictionary by ADMM and codes by LARS[34]
OCDL-Degraux [35]	spatial	Gaussian	BCD, updating codes and dictionary by projected BCD
OCDL-Liu [36]	frequency	Gaussian	BCD, updating codes and dictionary by proximal gradient descent
OCSC [37]	frequency	Gaussian	BCD, updating codes and dictionary by ADMM
CCSC [9]	frequency	Gaussian	BCD, updating codes and dictionary by ADMM
$α$ CSC [14]	spatial	alpha-stable	BCD, updating codes and dictionary by L-BFGS [38]
GCSC	frequency	Gaussian mixture	niAPG

Table 2. TABLE III: Performance on the synthetic data. The best and comparable results (according to the pairwise t-test with 95% confidence) are highlighted in bold.

		MAE	RMSE	time (seconds)
no noise	CSC- $ℓ_{2}$	0.000359 $\pm$ 0.000027	0.000847 $\pm$ 0.000109	419.64 $\pm$ 37.86
	CSC- $ℓ_{1}$	0.000364 $\pm$ 0.000171	0.000871 $\pm$ 0.000183	651.67 $\pm$ 98.94
	$α$ CSC	0.000362 $\pm$ 0.000109	0.000868 $\pm$ 0.000122	2217.44 $\pm$ 345.96
	GCSC	0.000330 $\pm$ 0.000125	0.000849 $\pm$ 0.000141	414.34 $\pm$ 34.48
Gaussian noise	CSC- $ℓ_{2}$	0.00368 $\pm$ 0.00036	0.00775 $\pm$ 0.00072	246.33 $\pm$ 55.39
	CSC- $ℓ_{1}$	0.0104 $\pm$ 0.0012	0.0249 $\pm$ 0.0008	715.44 $\pm$ 93.86
	$α$ CSC	0.00353 $\pm$ 0.00017	0.00766 $\pm$ 0.00052	1986.08 $\pm$ 262.47
	GCSC	0.00343 $\pm$ 0.00021	0.00762 $\pm$ 0.00026	238.52 $\pm$ 19.78
Laplace noise	CSC- $ℓ_{2}$	0.00835 $\pm$ 0.00042	0.0114 $\pm$ 0.0010	470.13 $\pm$ 27.44
	CSC- $ℓ_{1}$	0.00347 $\pm$ 0.00026	0.00735 $\pm$ 0.00038	358.64 $\pm$ 175.40
	$α$ CSC	0.00692 $\pm$ 0.00042	0.00973 $\pm$ 0.00013	2309.94 $\pm$ 724.53
	GCSC	0.00335 $\pm$ 0.00017	0.00732 $\pm$ 0.00020	350.76 $\pm$ 68.33
alpha-stable noise	CSC- $ℓ_{2}$	0.0702 $\pm$ 0.0060	0.0840 $\pm$ 0.0091	597.83 $\pm$ 90.70
	CSC- $ℓ_{1}$	0.0160 $\pm$ 0.0025	0.0337 $\pm$ 0.0047	476.24 $\pm$ 37.92
	$α$ CSC	0.00416 $\pm$ 0.00039	0.00821 $\pm$ 0.00024	2198.32 $\pm$ 470.57
	GCSC	0.00402 $\pm$ 0.00030	0.00815 $\pm$ 0.00041	412.43 $\pm$ 71.22
zero-mean mixture noise	CSC- $ℓ_{2}$	0.0321 $\pm$ 0.0007	0.0545 $\pm$ 0.0010	344.08 $\pm$ 27.44
	CSC- $ℓ_{1}$	0.0604 $\pm$ 0.0055	0.0849 $\pm$ 0.0059	588.57 $\pm$ 88.89
	$α$ CSC	0.0114 $\pm$ 0.0002	0.0158 $\pm$ 0.0003	1120.70 $\pm$ 463.70
	GCSC	0.00531 $\pm$ 0.00021	0.00971 $\pm$ 0.00082	336.00 $\pm$ 77.85
nonzero-mean mixture noise	CSC- $ℓ_{2}$	0.0732 $\pm$ 0.0015	0.0151 $\pm$ 0.0011	642.03 $\pm$ 84.44
	CSC- $ℓ_{1}$	0.0670 $\pm$ 0.0037	0.0130 $\pm$ 0.0013	788.60 $\pm$ 88.89
	$α$ CSC	0.0667 $\pm$ 0.0002	0.0127 $\pm$ 0.0014	2882.26 $\pm$ 907.28
	GCSC	0.00556 $\pm$ 0.00024	0.00818 $\pm$ 0.00037	471.40 $\pm$ 87.90

Table 3. TABLE II: Types of noise added to synthetic data.

noise	distribution	SNR (dB)
Gaussian	$𝒩 (0, {0.01}^{2})$	13.04
Laplace	$ℒ (0, 0.01)$	13.03
alpha-stable	$𝒮 (1, 0, {0.01}^{2}, 0)$	10.43
zero-mean mixture	20% from $𝒰 (- 0.01, 0.01)$ ,	10.98
	20% from $𝒩 (0, {0.01}^{2})$ ,
	60% from $𝒩 (0, {0.015}^{2})$
nonzero-mean mixture	20% from $𝒰 (- 0.01, 0.01)$ ,	14.16
	20% from $𝒩 (0.01, {0.01}^{2})$ ,
	60% from $𝒩 (- 0.005, {0.005}^{2})$

Table 4. TABLE IV: Performance of GCSC with different solvers for ( 12 ) on the synthetic data with nonzero-mean mixture noise.

	MAE	RMSE	time (seconds)
BCD	0.00557 $\pm$ 0.00031	0.00821 $\pm$ 0.00044	2562.37 $\pm$ 400.11
niAPG	0.00556 $\pm$ 0.00024	0.00818 $\pm$ 0.00037	471.40 $\pm$ 87.90

Table 5. TABLE V: Timing results (seconds) on the LFP data.

	LFP-cortical	LFP-striatal
CSC- $ℓ_{2}$	721.71 $\pm$ 47.21	802.37 $\pm$ 51.12
CSC- $ℓ_{1}$	737.62 $\pm$ 68.88	783.16 $\pm$ 76.10
$α$ CSC	2919.33 $\pm$ 290.77	3004.87 $\pm$ 320.81
GCSC	607.23 $\pm$ 57.56	611.24 $\pm$ 69.91

Table 6. TABLE VI: Performance on the retinal image data sets.

		AUC	best F-score
DRIVE	Expert	-	0.8935 $\pm$ 0.0000
	Hessian	0.7314 $\pm$ 0.0000	0.8755 $\pm$ 0.0000
	CSC- $ℓ_{2}$	0.9044 $\pm$ 0.0067	0.9806 $\pm$ 0.0065
	CSC- $ℓ_{1}$	0.9383 $\pm$ 0.0071	0.9821 $\pm$ 0.0063
	$α$ CSC	0.9401 $\pm$ 0.0051	0.9850 $\pm$ 0.0064
	GCSC	0.9504 $\pm$ 0.0048	0.9969 $\pm$ 0.0066
STARE	Expert	-	0.7790 $\pm$ 0.0000
	Hessian	0.6623 $\pm$ 0.0000	0.8495 $\pm$ 0.0000
	CSC- $ℓ_{2}$	0.9033 $\pm$ 0.0089	0.9838 $\pm$ 0.0080
	CSC- $ℓ_{1}$	0.8964 $\pm$ 0.0087	0.9757 $\pm$ 0.0073
	$α$ CSC	0.9101 $\pm$ 0.0056	0.9907 $\pm$ 0.0062
	GCSC	0.9203 $\pm$ 0.0066	0.9999 $\pm$ 0.0065

Equations171

\tilde{x}_{i} = k = 1 \sum K d_{k} * z_{ik} .

\tilde{x}_{i} = k = 1 \sum K d_{k} * z_{ik} .

{d_{k}} \in D, {z_{ik}} min i = 1 \sum N \frac{1}{2} x_{i} - k = 1 \sum K d_{k} * z_{ik}_{2}^{2} + k = 1 \sum K β ∥ z_{ik} ∥_{1},

{d_{k}} \in D, {z_{ik}} min i = 1 \sum N \frac{1}{2} x_{i} - k = 1 \sum K d_{k} * z_{ik}_{2}^{2} + k = 1 \sum K β ∥ z_{ik} ∥_{1},

{z_{ik}} min \frac{1}{2} x_{i} - k = 1 \sum K d_{k} * z_{ik}_{2}^{2} + β k = 1 \sum K ∥ z_{ik} ∥_{1} .

{z_{ik}} min \frac{1}{2} x_{i} - k = 1 \sum K d_{k} * z_{ik}_{2}^{2} + β k = 1 \sum K ∥ z_{ik} ∥_{1} .

{z_{ik}} min \frac{1}{2 P} F (x_{i}) - k = 1 \sum K F (d_{k}) ⊙ F (z_{ik})_{2}^{2} + β k = 1 \sum K ∥ z_{ik} ∥_{1} .

{z_{ik}} min \frac{1}{2 P} F (x_{i}) - k = 1 \sum K F (d_{k}) ⊙ F (z_{ik})_{2}^{2} + β k = 1 \sum K ∥ z_{ik} ∥_{1} .

{d_{k}} \in D min

{d_{k}} \in D min

{d_{k}} min

{d_{k}} min

∥ C F^{- 1} (F (C^{⊤} d_{k})) ∥_{2}^{2} \leq 1, \forall k,

x min F (x) \equiv f (x) + r (x),

x min F (x) \equiv f (x) + r (x),

prox_{η r} (z) = ar g x min \frac{1}{2} ∥ x - z ∥_{2}^{2} + η r (x)

prox_{η r} (z) = ar g x min \frac{1}{2} ∥ x - z ∥_{2}^{2} + η r (x)

p (ϵ_{i}) = g = 1 \sum G p (ϵ_{i} ∣ ϕ_{i} = g) p (ϕ_{i} = g),

p (ϵ_{i}) = g = 1 \sum G p (ϵ_{i} ∣ ϕ_{i} = g) p (ϕ_{i} = g),

lo g P = i = 1 \sum N (lo g g = 1 \sum G p (x_{i} ∣ ϕ_{i} = g) + lo g g = 1 \sum G p (ϕ_{i} = g) + k = 1 \sum K p = 1 \sum P lo g p (z_{ik} (p))) .

lo g P = i = 1 \sum N (lo g g = 1 \sum G p (x_{i} ∣ ϕ_{i} = g) + lo g g = 1 \sum G p (ϕ_{i} = g) + k = 1 \sum K p = 1 \sum P lo g p (z_{ik} (p))) .

p (ϕ_{i} = g ∣ x_{i})

p (ϕ_{i} = g ∣ x_{i})

ar g Θ max i = 1 \sum N (g = 1 \sum G γ_{g i} lo g \frac{p ( x _{i} , ϕ _{i} )}{γ _{g i}} + k = 1 \sum K p = 1 \sum P lo g p (z_{ik} (p)))

ar g Θ max i = 1 \sum N (g = 1 \sum G γ_{g i} lo g \frac{p ( x _{i} , ϕ _{i} )}{γ _{g i}} + k = 1 \sum K p = 1 \sum P lo g p (z_{ik} (p)))

= ar g Θ max i = 1 \sum N (g = 1 \sum G γ_{g i} lo g p (x_{i} ∣ ϕ_{i} = g) + β k = 1 \sum K ∥ z_{ik} ∥_{1})

= ar g Θ max i = 1 \sum N β k = 1 \sum K ∥ z_{ik} ∥_{1} + g = 1 \sum G γ_{g i} lo g π_{g} - \frac{γ _{g i}}{2} lo g (∣ Σ_{g} ∣) - \frac{γ _{g i}}{2} (x_{i} - k = 1 \sum K d_{k} * z_{ik} - μ_{g})^{⊤} Σ_{g}^{- 1} (x_{i} - k = 1 \sum K d_{k} * z_{ik} - μ_{g}),

{π_{g}, μ_{g}, Σ_{g}} max i = 1 \sum N g = 1 \sum G (γ_{g i} lo g π_{g} - \frac{γ _{g i}}{2} lo g (∣ Σ_{g} ∣) - \frac{γ _{g i}}{2} (x_{i} - \tilde{x}_{i} - μ_{g})^{⊤} Σ_{g}^{- 1} (x_{i} - \tilde{x}_{i} - μ_{g})) .

{π_{g}, μ_{g}, Σ_{g}} max i = 1 \sum N g = 1 \sum G (γ_{g i} lo g π_{g} - \frac{γ _{g i}}{2} lo g (∣ Σ_{g} ∣) - \frac{γ _{g i}}{2} (x_{i} - \tilde{x}_{i} - μ_{g})^{⊤} Σ_{g}^{- 1} (x_{i} - \tilde{x}_{i} - μ_{g})) .

π_{g}

π_{g}

μ_{g}

Σ_{g}

{d_{k}} \in D, {z_{ik}} min i = 1 \sum N β k = 1 \sum K ∥ z_{ik} ∥_{1} - g = 1 \sum G \frac{γ _{g i}}{2} (x_{i} - k = 1 \sum K d_{k} * z_{ik} - μ_{g})^{⊤} Σ_{g}^{- 1} (x_{i} - k = 1 \sum K d_{k} * z_{ik} - μ_{g}) .

{d_{k}} \in D, {z_{ik}} min i = 1 \sum N β k = 1 \sum K ∥ z_{ik} ∥_{1} - g = 1 \sum G \frac{γ _{g i}}{2} (x_{i} - k = 1 \sum K d_{k} * z_{ik} - μ_{g})^{⊤} Σ_{g}^{- 1} (x_{i} - k = 1 \sum K d_{k} * z_{ik} - μ_{g}) .

{d_{k}} \in D, {z_{ik}} min F ({d_{k}}, {z_{ik}}) \equiv f ({d_{k}}, {z_{ik}}) + r ({d_{k}}, {z_{ik}}),

{d_{k}} \in D, {z_{ik}} min F ({d_{k}}, {z_{ik}}) \equiv f ({d_{k}}, {z_{ik}}) + r ({d_{k}}, {z_{ik}}),

f ({d_{k}}, {z_{ik}}) \equiv \frac{1}{2} i = 1 \sum N g = 1 \sum G ∥ w_{g i} ⊙ (x_{i} - k = 1 \sum K d_{k} * z_{ik} - μ_{g}) ∥_{F}^{2},

f ({d_{k}}, {z_{ik}}) \equiv \frac{1}{2} i = 1 \sum N g = 1 \sum G ∥ w_{g i} ⊙ (x_{i} - k = 1 \sum K d_{k} * z_{ik} - μ_{g}) ∥_{F}^{2},

r ({d_{k}}, {z_{ik}}) \equiv β k = 1 \sum K ∥ z_{ik} ∥_{1} + I_{D} ({d_{k}}) .

r ({d_{k}}, {z_{ik}}) \equiv β k = 1 \sum K ∥ z_{ik} ∥_{1} + I_{D} ({d_{k}}) .

\frac{\partial f ({ d _{k} } , { z _{ik} })}{\partial d _{k}}

\frac{\partial f ({ d _{k} } , { z _{ik} })}{\partial d _{k}}

\frac{\partial f ({ d _{k} } , { z _{ik} })}{\partial z _{ik}}

\frac{\partial f ({ d _{k} } , { z _{ik} })}{\partial d _{k}}

\frac{\partial f ({ d _{k} } , { z _{ik} })}{\partial d _{k}}

= \frac{\partial g _{7 k}}{\partial d _{k}} \frac{\partial g _{6 k}}{\partial g _{7 k}} \frac{\partial g _{4}}{\partial g _{6 k}} \frac{\partial g _{3}}{\partial g _{4}} \frac{\partial g _{2}}{\partial g _{3}} \frac{\partial g _{1}}{\partial g _{2}} \frac{\partial f ({ d _{k} } , { z _{ik} })}{\partial g _{1}}

= - C F^{- 1} (i = 1 \sum N (F (z_{ik}))^{⋆} ⊙ F (u_{i})),

\frac{\partial f ({ d _{k} } , { z _{ik} })}{\partial z _{ik}}

= \frac{\partial g _{5 k}}{\partial z _{ik}} \frac{\partial g _{4}}{\partial g _{5 k}} \frac{\partial g _{3}}{\partial g _{4}} \frac{\partial g _{2}}{\partial g _{3}} \frac{\partial g _{1}}{\partial g _{2}} \frac{\partial f ({ d _{k} } , { z _{ik} })}{\partial g _{1}}

= - F^{- 1} ((F (d_{k}))^{⋆} ⊙ F (u_{i})) .

{d_{k}} \in D, {z_{ik}} min i = 1 \sum N (\frac{1}{2} x_{i} - k = 1 \sum K d_{k} * z_{ik}_{1} + k = 1 \sum K β ∥ z_{ik} ∥_{1}) .

{d_{k}} \in D, {z_{ik}} min i = 1 \sum N (\frac{1}{2} x_{i} - k = 1 \sum K d_{k} * z_{ik}_{1} + k = 1 \sum K β ∥ z_{ik} ∥_{1}) .

p = 1 \sum P \frac{∣ σ _{a}^{2} ( p ) - σ _{b}^{2} ( p ) ∣}{σ _{a}^{2} ( p ) + σ _{b}^{2} ( p )}, \forall a, b \in [1, \dots, G] .

p = 1 \sum P \frac{∣ σ _{a}^{2} ( p ) - σ _{b}^{2} ( p ) ∣}{σ _{a}^{2} ( p ) + σ _{b}^{2} ( p )}, \forall a, b \in [1, \dots, G] .

π_{a}

π_{a}

Σ_{a}

MAE =

MAE =

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsConvolution

Full text

General Convolutional Sparse Coding with Unknown Noise

Yaqing Wang, James T. Kwok, and Lionel M. Ni Y. Wang, J. T. Kwok and L.M. Ni are with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology University, Hong Kong.©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Abstract

Convolutional sparse coding (CSC) can learn representative shift-invariant patterns from multiple kinds of data. However, existing CSC methods can only model noises from Gaussian distribution, which is restrictive and unrealistic. In this paper, we propose a general CSC model capable of dealing with complicated unknown noise. The noise is now modeled by Gaussian mixture model, which can approximate any continuous probability density function. We use the expectation-maximization algorithm to solve the problem and design an efficient method for the weighted CSC problem in maximization step. The crux is to speed up the convolution in the frequency domain while keeping the other computation involving weight matrix in the spatial domain. Besides, we simultaneously update the dictionary and codes by nonconvex accelerated proximal gradient algorithm without bringing in extra alternating loops. The resultant method obtains comparable time and space complexity compared with existing CSC methods. Extensive experiments on synthetic and real noisy biomedical data sets validate that our method can model noise effectively and obtain high-quality filters and representation.

Index Terms:

Convolutional sparse coding, Noise modeling, Gaussian mixture model

I Introduction

Given a set of samples, sparse coding tries to learn an over-complete dictionary and then represent each sample as a sparse combination (code) of the dictionary atoms. It has been used in various signal processing [1, 2] and computer vision applications [3, 4]. Albeit its popularity, sparse coding cannot capture shifted local patterns in the data. Hence, pre-processing (such as the extraction of sample patches) and post-processing (such as aggregating patch representation back into sample representation) are needed, otherwise redundant representations will be learned.

Convolutional sparse coding (CSC) [5] is a recent method which improves sparse coding by learning a shift-invariant dictionary. This is done by replacing the multiplication between codes and dictionary by convolution operation, which can capture local patterns of shifted locations in the data. Therefore, no pre-processing or post-processing are needed, and the sample can be optimized as a whole and represented as the sum of a set of filters from the dictionary convolved with the corresponding codes. It has been successfully used on various data types, including trajectories [6], images [7], audios [8], videos [9], multi-spectral and light field images [10] and biomedical data [11, 12, 13, 14]. It also succeeds on a variety of applications accompanying the data, such as recovering non-rigid structure from motion [6], image super resolution [15], image denoising and inpainting [10], music transcription [8], video deblurring [9], neuronal assembly detection [16] and so on.

All above CSC works use square loss, thus assume that the noise in the data is from Gaussian distribution. However, this can be restrictive and does not suit many real-world problems. For example, although CSC is popularly used for biomedical data sets [11, 13, 14, 16] where shifting patterns abound due to cell division, it cannot handle the various complicated noises in the data. In fact, biomedical data sets usually contain artifacts during recording, e.g., biomedical heterogeneities, large variations in luminance and contrast, and disturbance due to other small living animals [11, 17]. Moreover, as the target biomedical structures are often tiny and delicate, the existing of noises will heavily interfere the quality of the learned filters and representation [14].

Lots of algorithms have been proposed for CSC with square loss. While the objective is not convex, it is convex when codes of the dictionary are fixed. Thus, CSC is mainly solved by alternatively update the codes and dictionary by block coordinate descent (BCD) [18]. The difference of methods mainly lies in how to solve the subproblems (codes update or dictionary update) separately. The pioneering work Deconvolutional network [5] uses gradient descent for both subproblems. ConvCoD [19] uses stochastic gradient descent for dictionary update and additionally learns an encoder to output the codes. Recently, other works [7, 9, 20, 21, 22, 23] use alternating direction method of multipliers (ADMM) [24]. ADMM is favored since it can decompose the subproblem into smaller ADMM subproblems which usually have closed-form solutions. The decomposition allows solving CSC by separately performing faster convolution in frequency domain while enforcing the translation-variant constraints and regularizers in spatial domain. However, one needs another alternating loop between these ADMM subproblems so as to coordinate on the solution of the original subproblem.

Recently, there emerges one work which models the noise in CSC other than Gaussian. Jas et al. [14] proposed the alpha-stable CSC ( $\alpha$ CSC) which models the noise in 1D signals by the symmetric alpha-stable distribution [25]. This distribution includes a range of heavy-tailed distributions, and is known to be more robust to noise and outliers. However, the probability density function of the alpha-stable distribution does not have an analytical form, and its inference needs to be approximated by the Markov chain Monte Carlo (MCMC) [26] procedure, which is known to be computationally expensive. Moreover, as shown in Figure. 1, the alpha-stable distribution still restricts the noise to be of one particular type in advance, which is not appropriate due to unknown ground truth noise type.

In this paper, we propose a general CSC model (GCSC) which enables CSC to deal with complicated unknown noise. Specifically, we model the noise in CSC by the Gaussian mixture model (GMM), which can approximate any continuous probability density function. The proposed model is then solved by the Expectation-Maximization algorithm (EM). However, the maximization step becomes a weighted variant of the CSC problem which cannot be efficiently solved by existing algorithms, e.g., BCD and ADMM, since they bring extra inner loops in M-step. Besides, the weight matrix prevents us from solving the whole objective in the frequency domain to speed up the convolution. In our proposed method, we develop a new solver to update the dictionary and codes together by a nonconvex accelerated proximal algorithm without alternating loops. Moreover, we manage to efficiently speed up the convolution in the frequency domain and calculate the part involving the weight matrix in the spatial domain. The resultant algorithm achieves comparable time and space complexity compared with state-of-the-art CSC algorithms (for square loss). Extensive experiments are performed on both synthetic data and real-world biomedical data such as local field potential signals and retinal scan images. Results show that the proposed method can model the complex underlying data noise, and obtain high-quality filters and representations.

The rest of the paper is organized as follows. Section II briefly reviews CSC and proximal algorithm. Section III describes the proposed method, Experimental results are presented in Section IV, and the last section gives some concluding remarks.

Notations: For vector $a\in\mathbb{R}^{m}$ , its $i$ th element is denoted $a(i)$ , its $\ell_{2}$ -norm is $\|a\|_{2}=\sqrt{\sum_{i=1}^{m}(a(i))^{2}}$ , its $\ell_{1}$ -norm is $\|a\|_{1}=\sum_{i=1}^{m}|a(i)|$ , and $\text{Diag}(a)$ reshapes $a$ to a diagonal matrix with elements $a(i)$ ’s. Given another vector $b\in\mathbb{R}^{n}$ , the convolution $a*b$ produces a vector $c\in\mathbb{R}^{m+n-1}$ , with $c(k)=\sum_{j=\max(1,k+1-n)}^{\min(k,m)}a(j)b(k-j+1)$ . For matrix $A$ , $A^{\top}$ denotes its transpose, $A^{\star}$ denotes its complex conjugate, $A^{\dagger}$ is its conjugate transpose (i.e., $A^{\dagger}=({A^{\top}})^{\star}$ ). $\odot$ denotes pointwise product. The identity matrix is denoted $I$ . $\mathcal{F}(x)$ is the fast Fourier transform that maps $x$ from the spatial domain to the frequency domain, while $\mathcal{F}^{-1}(x)$ is the inverse operator which maps $\mathcal{F}(x)$ back to $x$ .

II Related Work

II-A Convolutional Sparse Coding

Given $N$ samples ${x}_{i}$ ’s, where each ${x}_{i}\in\mathbb{R}^{P}$ , CSC learns a dictionary of $K$ filters $d_{k}$ ’s, each of length $M$ , such that each ${x}_{i}$ can be well represented as

[TABLE]

Here, $*$ is the (spatial) convolution operation, and $z_{ik}$ ’s are the codes for $x_{i}$ , each of length $P$ . The filters and codes are obtained by solving the following optimization problem

[TABLE]

where $\mathcal{D}=\{D:\|d_{k}\|_{2}\leq 1,k=1,\dots,K\}$ ensures that the filters are normalized, and the $\ell_{1}$ regularizer encourages the codes to be sparse.

To solve (2), block coordinate descent (BCD) [18] is typically used [5, 7, 19, 20, 21, 22]. The codes and dictionary are updated in an alternating manner as follows.

II-A1 Code Update

Given $d_{k}$ ’s, the corresponding $z_{ik}$ ’s are obtained as

[TABLE]

Convolution can be performed much faster in the frequency domain via the convolution theorem111 $\mathcal{F}(d_{k}*z_{ik})=\mathcal{F}(d_{k})\odot\mathcal{F}(z_{ik})$ , where $d_{k}$ is first zero-padded to $P$ -dimensional. [27]. Combining this with the use of Parseval’s theorem222 For $a\in\mathbb{R}^{P}$ , $\|a\|_{2}^{2}=\frac{1}{P}\|\mathcal{F}(a)\|_{2}^{2}$ . [27] and the linearity of FFT, problem (3) is reformulated in [20, 7, 21] as:

[TABLE]

II-A2 Dictionary Update

Given $z_{ik}$ ’s, $d_{k}$ ’s is updated by solving

[TABLE]

Similar to the code update, it is more efficient to perform convolution in the frequency domain, as:

[TABLE]

where $C\in\mathbb{R}^{M\times P}$ is a matrix with $C(i,i)=1$ and $C(i,j)=0$ for $i\neq j$ which is used to crop the extra dimension to recover the original spatial support, and $C^{\top}$ can pad $d_{k}$ to be $P$ -dimensional. The constraint scales all filters to unit norm.

The alternating direction method of multipliers (ADMM) [24] has been commonly used for the code update and dictionary update subproblems (4) and (5)) [7, 20, 21, 22]. Each ADMM subproblem has a closed-form solution that is easy to compute. Besides, with the introduction of auxiliary variables, ADMM separates computations in the frequency domain (involving convolutions) and spatial domain (involving the $\ell_{1}$ regularizer and unit norm constraint).

II-B Proximal Algorithm

The proximal algorithm [28] is used to solve composite optimization problems of the form

[TABLE]

where $f$ is smooth, $r$ is nonsmooth, and both are convex. To make the proximal algorithm efficient, its underlying proximal step (with stepsize $\eta$ )

[TABLE]

has to be inexpensive.

Recently, the proximal algorithm has been extended to nonconvex problems where both $f$ and $r$ can be nonconvex. A state-of-the-art is the nonconvex inexact accelerated proximal gradient (niAPG) algorithm [29], shown in Algorithm 1. It can efficiently converge to a critical point of the objective.

III Proposed Method

The square loss in (2) implicitly assumes that the noise is normally distributed. In this section, we relax this assumption, and assume the noise to be generated from a Gaussian mixture model (GMM). It is well-known that a GMM can approximate any continuous probability density function [30].

As in other applications of GMM, we will use the EM algorithm for inference. However, as will be shown, the M-step involves a difficult weighted CSC problem. In Section III-C, we design an efficient solver based on nonconvex accelerated proximal algorithm, with comparable time and space complexity as state-of-the-art CSC algorithms (for the square loss).

III-A GMM Noise

We assume that the noise $\epsilon_{i}$ associated with ${x}_{i}$ follows the GMM distribution:

[TABLE]

where $G$ is the number of Gaussian components, $\phi_{i}$ is the latent variable denoting which Gaussian component $\epsilon_{i}$ belongs to, and $\pi_{g}$ ’s are mixing coefficients with $\sum_{g=1}^{G}\pi_{g}=1$ . Variable $\phi_{i}$ follows the multinomial distribution $\text{Multinomial}(\{\pi_{g}\})$ , and the conditional distribution of $\epsilon_{i}$ given $\phi_{i}=g$ follows the normal distribution $\mathcal{N}(\mu_{g},\Sigma_{g})$ with mean $\mu_{g}$ and diagonal covariance matrix $\Sigma_{g}=\text{Diag}(\sigma^{2}_{g}(1),\dots,\sigma^{2}_{g}(P))$ . The $\ell_{1}$ regularizer in (2) corresponds to the prior Laplace distribution ( $\text{Laplace}(0,\frac{1}{\beta})$ ) on each $p$ th element of $z_{ik}$ : $p(z_{ik}(p))=\frac{\beta}{2}\exp(-\beta|z_{ik}(p)|)$ .

III-B Using Expectation Maximization (EM) Algorithm

Let $\Theta$ denote the collection of all parameters $\pi_{g}$ ’s, $\mu_{g}$ ’s, $\Sigma_{g}$ ’s, $d_{k}$ ’s and $z_{ik}$ ’s. The log posterior probability for $\Theta$ is:

[TABLE]

This can be maximized by the Expectation Maximization (EM) algorithm [31].

The E-step computes $p(\phi_{i}\!=\!g|{x}_{i})$ , the posterior probability that $\phi_{i}$ belongs to the $g$ th Gaussian given ${x}_{i}$ . Using Bayes rule, we have

[TABLE]

where ${x}_{i}|(\phi_{i}=g)\sim\mathcal{N}(\tilde{x}_{i}+\mu_{g},\Sigma_{g})$ , and $\tilde{x}_{i}=\sum_{k=1}^{K}d_{k}*z_{ik}$ in (1).

The M-step obtains $\Theta$ by maximizing the following upper bound of $\log\mathcal{P}$ in (6) as

[TABLE]

where $\gamma_{gi}=p(\phi_{i}=g|{x}_{i})$ .

III-B1 Updating $\pi_{g}$ ’s, $\mu_{g}$ ’s, $\Sigma_{g}$ ’s

in $\Theta$

Given $d_{k}$ ’s, $z_{ik}$ ’s and $\gamma_{gi}$ ’s, let $\tilde{x}_{i}=\sum_{k=1}^{K}d_{k}\!*\!z_{ik}$ as in (1), we obtain $\pi_{g}$ ’s, $\mu_{g}$ ’s and $\Sigma_{g}$ ’s by optimizing (8) as:

[TABLE]

Taking the derivative of the objective to zero, the following closed-form solutions can be easily obtained:

[TABLE]

III-B2 Updating $d_{k}$ ’s and $z_{ik}$ ’s

in $\Theta$

Given $\pi_{g}$ ’s, $\mu_{g}$ ’s, $\Sigma_{g}$ ’s and $\gamma_{gi}$ ’s, we obtain $d_{k}$ ’s and $z_{ik}$ ’s from (8) as:

[TABLE]

This can be rewritten as

[TABLE]

where

[TABLE]

with $w_{gi}(p)=\sqrt{\frac{\gamma_{gi}}{\sigma^{2}_{g}(P)}}$ , and

[TABLE]

Here, $I_{\mathcal{D}}(\cdot)$ is the indicator function on $\mathcal{D}$ (i.e., $I_{\mathcal{D}}(\{d_{k}\})=0$ if $\{d_{k}\}\in\mathcal{D}$ , and $\infty$ otherwise).

Compared with the standard CSC problem in (2), problem (12) can be viewed as a weighted CSC problem (with weight $w_{gi}$ ). A CSC variant [7] puts weights on $\sum_{k=1}^{K}d_{k}*z_{ik}$ , while ours are on ${x}_{i}-\sum_{k=1}^{K}d_{k}\!*\!z_{ik}-\mu_{g}$ . In [14], the model also leads to a weighted CSC problem. However, the authors there mentioned that it is not clear how to solve a weighted CSC problem in the frequency domain. Instead, they resorted to solving it in the spatial domain, which is less efficient as discussed in Section II-A.

III-C Solving the Weighted CSC Subproblem (12)

In this section, we solve the weighted CSC problem in (12) using niAPG [29] (Algorithm 1). Note that the weights $w_{gi}$ ’s in $f$ (13) is the cause that prevents us from transforming the whole objective (12) to the frequency domain. Recall that (3) is transformed to (4) by first transforming everything in $\ell_{2}$ norm to frequency domain by Parseval’s theorem, separately computing $\mathcal{F}(x_{i})$ and $\mathcal{F}(\sum_{k=1}^{K}d_{k}\!*\!z_{ik})$ by linearity of FFT, then replacing the convolution in $\mathcal{F}(\sum_{k=1}^{K}d_{k}\!*\!z_{ik})$ by pointwise product using convolution theorem. However, with $w_{gi}$ ’s, $\mathcal{F}(w_{gi}\odot(\sum_{k=1}^{K}d_{k}\!*\!z_{ik}))$ cannot use convolution theorem to speed up. Therefore, the key in designing an efficient solver is to only transform terms involving convolutions to the frequency domain, while leaving the weight $w_{gi}$ in spatial domain. Hence we replace the ${x}_{i}\!-\!\sum_{k=1}^{K}d_{k}\!*\!z_{ik}\!-\!\mu_{g}$ term in (13) by $\mathcal{F}^{-1}{(\mathcal{F}{({x}_{i}-\mu_{g})}-\sum_{k=1}^{K}\mathcal{F}{(C^{\top}{d_{k}})}\odot\mathcal{F}{(z_{ik})})}$ .

The core steps in the niAPG algorithm are computing (i) the gradient $\nabla f(\cdot)$ w.r.t. $d_{k}$ ’s and $z_{ik}$ ’s , and (ii) the proximal step $\text{prox}_{\eta r}(\cdot)$ . We first introduce the following Lemmas.

Lemma 1.

Let $f(x)=a\odot x$ for $a,x\in\mathbb{C}^{P}$ . Then, $\nabla_{x}f(x)=\text{Diag}(a^{\star})$ , which reshapes $a^{\star}$ to a diagonal matrix with elements $a^{\star}(p)$ ’s.

Proof.

$f(x)=Ax$ , where $A=\text{Diag}(a)$ . Then, $\nabla_{x}f(x)=A^{\dagger}=\text{Diag}(a^{\star})$ . ∎

Lemma 2.

[32]** For $x\in\mathbb{R}^{P}$ , $\mathcal{F}(x)=\Phi x$ and $\mathcal{F}^{-1}(x)=\frac{1}{P}{\Phi}^{\dagger}x$ , where $\Phi=[\frac{\omega^{jk}}{P}]\in\mathbb{R}^{P\times P}$ , $\omega=e^{\frac{-2\pi i}{P}}$ is the $P$ th root of unity, and $i=\sqrt{-1}$ . Moreover, $\nabla_{x}\mathcal{F}(x)=\Phi^{\dagger}=P\mathcal{F}^{-1}(\cdot)$ and $\nabla_{x}\mathcal{F}^{-1}(x)=\frac{1}{P}{\Phi}=\frac{1}{P}\mathcal{F}(\cdot)$ .

Proposition 3.

For $f$ in (13),

[TABLE]

where $u_{i}=\sum_{g=1}^{G}w_{gi}\odot w_{gi}\odot\mathcal{F}^{-1}(\mathcal{F}({x}_{i}-\mu_{g})-\sum_{k=1}^{K}\mathcal{F}(C^{\top}{d_{k}})\odot\mathcal{F}({z_{ik}}))$ .

Proof.

$f$ can be rewritten as $f(\{d_{k}\},\{z_{ik}\})=\sum_{i=1}^{N}\sum_{g=1}^{G}\frac{1}{2}\|g_{1}\|_{2}^{2}$ , where $g_{1}=w_{gi}\odot g_{2}$ , $g_{2}=\mathcal{F}^{-1}(g_{3})$ , $g_{3}=\mathcal{F}({x}_{i}-\mu_{g})-g_{4}$ , $g_{4}=\sum_{k=1}^{K}g_{6k}\odot g_{5k}$ , $g_{5k}=\mathcal{F}({z_{ik})}$ , $g_{6k}=\mathcal{F}(g_{7k})$ , $g_{7k}=C^{\top}{d_{k}}$ .

Using Lemma 1, $\frac{\partial g_{1}}{\partial g_{2}}=\text{Diag}(w_{gi})$ , $\frac{\partial g_{4}}{\partial g_{5k}}=\text{Diag}(g^{\star}_{6k})$ , and $\frac{\partial g_{4}}{\partial g_{6k}}=\text{Diag}(g^{\star}_{5k})$ . Using Lemma 2, $\frac{\partial g_{2}}{\partial g_{3}}=\frac{1}{P}{\Phi}$ , $\frac{\partial g_{5k}}{\partial z_{ik}}=\Phi^{\dagger}$ , $\frac{\partial g_{6k}}{\partial g_{7k}}=\Phi^{\dagger}$ . Finally $\frac{\partial g_{3}}{\partial g_{4}}=-1$ , $\frac{\partial g_{7k}}{\partial d_{k}}=C$ and $\frac{\partial f(\{d_{k}\},\{z_{ik}\})}{\partial g_{1}}=\sum_{i=1}^{N}\sum_{g=1}^{G}g_{1}=\sum_{i=1}^{N}\sum_{g=1}^{G}w_{gi}\odot\mathcal{F}^{-1}(\mathcal{F}({x}_{i}-\mu_{g})-\sum_{k=1}^{K}\mathcal{F}(C^{\top}{d_{k}})\odot\mathcal{F}({z_{ik}}))$ .

Combining all these, using chain rule for denominator layout, we obtain

[TABLE]

∎

Note that $r(\{d_{k}\},\{z_{ik}\})$ in (14) is separable333 $r(x,y)$ is separable if $r(x,y)=r_{1}(x)+r_{2}(y)$ .. This simplifies the associated proximal step, as shown by the following Lemma.

Lemma 4.

[28*]**

If $r(x,y)$ is separable, $\text{prox}_{r}(v,w)=\text{prox}_{r_{1}}(v)+\text{prox}_{r_{2}}(w)$ .*

Using Lemma 4, the component proximal steps can be easily computed in closed form as [28]: $\text{prox}_{\eta I_{\mathcal{D}}}(d_{k})=d_{k}/\max(\|d_{k}\|_{2},1)$ , and $\text{prox}_{\beta\eta\|\cdot\|_{1}}(z_{ik}(p))=\text{sign}(z_{ik}(p))\odot\max(|z_{ik}(p)|-\beta\eta,0)$ . In the sequel, we avoid tuning $\eta$ by using line search [33], which also speeds up convergence empirically. The procedure for the solving the weighted CSC subproblem (13) is shown in Algorithm 2. The whole algorithm, which will be called general CSC (GCSC), is shown in Algorithm 3.

III-D Complexity Analysis

In each EM iteration, the E-step in (7) takes $O(GNP)$ time. The M-step is dominated by gradient computations in (15) and (16). These take $O(NKP\log P)$ time for the underlying FFT and inverse FFT operations, and $O(GNP)$ time for the pointwise product. Thus, each EM iteration takes a total of $O(JGNP+JNKP\log P)$ time, where $J$ is the number of niAPG iterations. Empirically, $J$ is around 50. As for space, this is dominated by the $K$ $P$ -dimensional codes for each of the $N$ samples, leading to a space complexity of $O(NKP)$ .

In comparison, the state-of-the-art batch CSC method (which uses the square loss) [7] takes $O(NK^{2}P+NKP\log P)$ time per iteration and $O(NKP)$ space. Usually, $JG\ll K^{2}$ .

III-E Discussion with Existing CSC Works

Table I compares GCSC with existing CSC algorithms. The key differences are in noise modeling and algorithm design. First, all methods except GCSC and $\alpha$ CSC model the noise by Gaussian distribution, and $\alpha$ CSC uses symmetric alpha-stable distribution. Recall that GMM can approximate any continuous distribution, the noises considered previously all are special case of GMM noise. Second, all algorithms except GCSC use BCD, which alternatively updates codes and dictionary, and a majority of methods then update the codes and dictionary by ADMM separately. As GCSC already has one alternating loop between E-step and M-step, using BCD and then ADMM will bring in two more alternating loops, resulting a much slower algorithm compared with existing CSC algorithms. Therefore, we use niAPG to directly update codes and dictionary together. Empirical results in next section validate the efficiency of solving the weighted CSC problem (12) in GCSC by niAPG, rather than BCD.

IV Experiments

In this section, we perform experiments on both synthetic and real-world data sets. Experiments are performed on a PC with Intel i7 4GHz CPU with 32GB memory.

IV-A Baseline Algorithms

The proposed GCSC is compared with the following CSC state-of-the-arts:

CSC- $\ell_{2}$ [7]444http://www.cs.ubc.ca/labs/imager/tr/2015/FastFlexibleCSC/, which models noise by the Gaussian distribution. 2. 2.

Alpha-stable CSC ( $\alpha$ CSC) [14]555https://alphacsc.github.io/, which uses symmetric alpha-stable distribution to models noise by setting the parameters of alpha-stable distribution $\mathcal{S}(\alpha,\beta,\sigma,\mu)$ (with stability parameter $\alpha$ , skewness parameter $\beta$ , scale parameter $\sigma$ and position parameter $\mu$ ) as $\beta=0,\sigma=\frac{1}{\sqrt{2}}$ and $\mu=\tilde{x}_{i}(p)$ where $\tilde{x}_{i}$ is defined as in (1).666In [14], $\alpha$ is simply set to 1.2. Here, we choose $\alpha$ by using a validation set, which is obtained by randomly sampling 20% of the samples.

As a further baseline, we also compare with CSC- $\ell_{1}$ , a CSC variant which models noise by the Laplace distribution. It is formulated as the following optimization problem:

[TABLE]

Details are in Appendix A.

We follow the automatic pruning strategy in [39] to select the number of mixture components $G$ in GCSC. We start with a relatively large $G=10$ . At each EM iteration, the relative difference among all Gaussians are computed:

[TABLE]

For the Gaussian pair with the smallest relative difference, if this value is small (less than 0.1), they are merged as

[TABLE]

Stopping Criteria

The optimization problems in $\alpha$ CSC and GCSC are solved by the EM algorithm. We stop the EM iterations when the relative change of log posterior in consecutive iterations is smaller than ${10}^{-4}$ . In the M-step, we stop the updating of weighted CSC (Algorithm 2 for GCSC, and Algorithm in Appendix B of $\alpha$ CSC paper [14]) if the relative change of the respective objective value is smaller than ${10}^{-4}$ .

The optimization problems in CSC- $\ell_{1}$ and CSC- $\ell_{2}$ are solved by BCD. Alternating minimization is stopped when the relative change of objective value ((2) for CSC- $\ell_{2}$ and (17) for CSC- $\ell_{1}$ ) in consecutive iterations is smaller than ${10}^{-4}$ . As for the optimization subproblems of $d_{k}$ ’s (given $z_{ik}$ ’s) and $z_{ik}$ ’s (given $d_{k}$ ’s), we stop when the relative change of objective value is smaller than ${10}^{-4}$ .

IV-B Synthetic Data

In this experiment, we first demonstrate the performance on synthetic data. Following [14], we use $K=3$ filters $d_{k}$ ’s (triangle, square, and sine), each of length $M=65$ (Figure 2(a)). Each $d_{k}$ is normalized to have zero mean and unit variance. Each $z_{ik}$ has only one nonzero entry, whose magnitude is uniformly drawn from $[0,1]$ (Figure 2(b)). $N=100$ clean samples, each of length $P=512$ , are generated as: ${x}^{\text{clean}}_{i}=\sum_{k=1}^{K}d_{k}\!*\!z_{ik}$ (Figure 2(c)).

Noise is then added to generate observations ${x}_{i}$ ’s. Following [39], different types of noise are considered (Table II). The alpha-stable noise we considered is Cauchy distribution, which is one representative symmetric alpha-stable distribution apart from Gaussian and should be modeled well by $\alpha$ CSC and GCSC.

IV-B1 Quantitative Evaluation

Following [39, 40], performance is evaluated by the mean absolute error (MAE) and root mean squared error (RMSE):

[TABLE]

where $\tilde{x}_{i}=\sum_{k=1}^{K}d_{k}*z_{ik}$ is the reconstruction based on the obtained $d_{k}$ ’s and $z_{ik}$ ’s. Results are averaged over five runs with different initializations of $d_{k}$ ’s and $z_{ik}$ ’s.

Results are shown in Table III. When there is no noise, all methods obtain MAE and RMSE in the same order of magnitude. When there is noise, intuitively, the model whose underlying noise assumption matches the actual noise distribution will perform the best. Empirically, GCSC is the best or comparable with the best method on all types of noise.

As for time, CSC- $\ell_{2}$ and GCSC are the fastest in general. CSC- $\ell_{1}$ is slower as it needs more auxiliary variables in ADMM to handle the nonsmooth $\ell_{1}$ loss (details are in Appendix A). $\alpha$ CSC is the slowest, as it performs in the spatial domain which is costly. Moreover, it requires expensive Markov chain Monte Carlo (MCMC) in its E-step.

IV-B2 Visual Comparison

Figure 3 compares the ground truth noise with those fitted by the models. As can be seen, GCSC models the noise well for all types of noise, while the other methods only model the noise well when its underlying noise distribution matches the actual noise.

Figure 4 shows the learned filters. As can be seen, GCSC can recover the underlying filters more reliably. Figure 5 further shows the reconstructions on synthetic data with nonzero-mean noise. GCSC is the only one that denoises well and recover the underlying the clean data.

IV-B3 Solving (12): niAPG vs BCD

We first consider solving (12) in the M-step of one EM iteration. The details of BCD solver are in Appendix B. Figure 6 shows convergence of solving (12) with time on synthetic data with nonzero-mean noise. As shown, niAPG converges much faster than BCD. It rapidly starts to reduce objective and converges to a smaller objective. Thus, using niAPG to solve the (weighted) CSC problem is a more efficient choice.

Further, we show the performance of the whole GCSC with GMM loss using different solvers for (12) in the M-step in Table IV. Although the two solvers obtain similar MAE and RMSE, BCD solver takes much longer time than niAPG solver.

IV-C Local Field Potential Data

In this section, experiments are performed on two real local field potential (LFP) data sets from [14]. LFP is an electrophysiological signal recording the collective activities of a group of nearby neurons. It is closely related to cognitive mechanisms such as attention, high-level visual processing and motor control. The first signal (LFP-cortical) is recorded in the rat cortex [17], while the second one (LFP-striatal) is recorded in the rat striatum [41]. Figure 8 shows samples from these two data sets. Note that LFP-striatal contains heavier artifact as shown in the local segment. Following [14], we extract $N=100$ non-overlapping segments, each of length $P=2500$ , from each data set. The other preprocessing steps and parameter setting ( $K=3$ and $M=350$ ) are the same as in [14].

Figure 7 shows the learned filters. As there is no ground truth, we can only evaluate the results qualitatively. For LFP cortical, the learned filters are similar to the local regions in segments. As for LFP striatal, severe artifacts contaminate the filters learned by CSC- $\ell_{1}$ and CSC- $\ell_{2}$ , but do not prevent GCSC and $\alpha$ CSC from learning filters similar in shape to the clean part of the segments. We also compare the time in Table V. On both LFP-cortical and LFP striatal, GCSC is the fastest.

IV-D Retinal Image Data

In this section, we perform vessel segmentation via pixelwise classification of retinal image data sets. The pixel on the retinal vessels is classified as 1 while the pixel on the background is classified as 0. Two popular retinal image data sets, DRIVE [42] and STARE [43] obtained from [44], are used. DRIVE contains 40 images of size $584\times 565$ and STARE contains 20 images of size $605\times 700$ . Training and testing images are split in half as in [44]. Both data sets are provided with manual segmentation results from two experts. Following [12, 44], we use the first expert’s segmentation as ground truth.

The proposed GCSC is compared with CSC- $\ell_{2}$ , CSC- $\ell_{1}$ and $\alpha$ CSC (all with $K=50$ and $M=11\times 11$ ). Following [44], each pixel, represented by the $K$ learned codes, is classified using gradient boosting [45] with 500 weak learners. From each image, 15,000 vessel pixels are sampled as positive, and 15,000 background pixels are sampled as negative. As further baselines, we compare with the provided second expert’s manual segmentation (denoted “Expert”) and the state-of-the-art handcrafted multi-scale Hessian filter (denoted “Hessian”)777https://www.mathworks.com/matlabcentral/fileexchange/63171-jerman-enhancement-filter [46]. The experiment is repeated five times.

Figure 9 shows the Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves. Table VI shows the corresponding Area Under ROC Curve (AUC) and the best F-score. As can be seen, GCSC outperforms all the other methods. $\alpha$ CSC performs slightly better than CSC- $\ell_{1}$ and CSC- $\ell_{2}$ . The multi-scale Hessian filter is much worse. The classification performance of Expert is low, which is also noted in [12].

Figures 10 and 11 show the segmentation results from a test image from DRIVE and STARE, respectively. As can be seen, the segmentation results produced by $\alpha$ CSC, CSC- $\ell_{2}$ and CSC- $\ell_{1}$ are still noisy. The Hessian filter shows clearer vessels, but enlarges the pupil and shrinks some tiny vessels. In contrast, GCSC obtains cleaner segmented vessels.

V Conclusion

In this paper, we propose a CSC method which is able to deal with various kinds of noises. We model the noises by Gaussian mixture model, and solve it by expectation-maximization algorithm. In the maximization step, the problem reduces to be a weighted CSC problem and we use a nonconvex and inexact accelerated proximal gradient algorithm without alternating. Extensive experiments on synthetic and real noisy biomedical data sets show that our method can model the complicated noises well and in turn obtain high-quality filters and representation.

Appendix A CSC- $\ell_{1}$ : Detailed Algorithm

In robust learning, the $\ell_{1}$ loss, which corresponds to the Laplace distribution, is often used to handle outliers. Here, we present the detailed algorithm for CSC with $\ell_{1}$ loss, which is called CSC- $\ell_{1}$ in Section IV.

The objective of CSC- $\ell_{1}$ is:

[TABLE]

$d_{k}$ ’s and $z_{ik}$ ’s are updated by BCD until convergence.

A-A Dictionary Update

Given $z_{ik}$ ’s, dictionary $d_{k}$ ’s are obtained by reformulating (18) as:

[TABLE]

where $e_{i}$ ’s and $v_{k}$ ’s are auxiliary variables. This can then be solved by ADMM. We first form the augmented Lagrangian as

[TABLE]

where $\rho$ is the ADMM penalty parameter, $\theta_{k}$ ’s and $\alpha_{i}$ ’s are dual variables. At $\tau$ th iteration, ADMM alternately updates $d^{\tau}_{k}$ ’s, $e_{i}^{\tau}$ ’s, $v^{\tau}_{k}$ ’s, $\alpha^{\tau}_{i}$ ’s and $\theta^{\tau}_{k}$ ’s until convergence.

$d^{\tau}_{k}$ ’s are updated as

[TABLE]

Convolution is more efficient in frequency domain, we update $d_{k}$ ’s therein. Let $\eta^{\tau-1}_{i}=x_{i}-e^{\tau-1}_{i}-\frac{\alpha^{\tau-1}_{i}}{\rho}$ , we transform all variables to frequency domain (denoted as symbol with hat) using FFT as $\hat{d}_{k}=\mathcal{F}(C^{\top}d_{k})$ , $\hat{v}_{k}^{\tau-1}=\mathcal{F}(C^{\top}v_{k}^{\tau-1})$ , $\hat{\theta}_{k}^{\tau-1}=\mathcal{F}(C^{\top}\theta_{k}^{\tau-1})$ , $\hat{z}_{ik}=\mathcal{F}(z_{ik})$ , $\hat{\eta}_{i}^{\tau-1}=\mathcal{F}(\eta_{i}^{\tau-1})$ and $\hat{\alpha}_{i}^{\tau-1}=\mathcal{F}(\alpha_{i}^{\tau-1})$ where $C$ is the padding matrix used as in (5). Using these frequency-domain variables, (19) can be written as:

[TABLE]

This can be solved by closed form solution. Using the reordering trick used in [21, 37], we first put all $d_{k}$ ’s in the columns of $\hat{D}=[\hat{d}_{1},\dots,\hat{d}_{K}]$ , as well as $\hat{V}^{\tau-1}=[\hat{v}^{\tau-1}_{1},\dots,\hat{v}^{\tau-1}_{K}]$ , $\hat{\Theta}^{\tau-1}=[\hat{\theta}^{\tau-1}_{1},\dots,\hat{\theta}^{\tau-1}_{K}]$ , and $\hat{Z}_{i}=[\hat{z}_{i1},\dots,\hat{z}_{iK}]$ . Then $p$ th row $\hat{D}^{\tau}(p,:)$ is updated as

[TABLE]

Then $d_{k}^{\tau}$ can be recovered as $C\mathcal{F}^{-1}(\hat{d}_{k}^{\tau})$ .

Each $e^{\tau}_{i}$ is then independently updated as

[TABLE]

with closed-form solution for $p$ th element:

[TABLE]

where $\nu={x}_{i}-\sum_{k=1}^{K}d^{\tau}_{k}*z_{ik}-\frac{\alpha^{\tau-1}_{i}}{\rho}$ .

$v^{\tau}_{k}$ is updated as

[TABLE]

with closed-form solution

[TABLE]

Finally, $\alpha^{\tau}_{i}$ and $\theta^{\tau}_{k}$ are updated as:

[TABLE]

A-B Code Update

Given ${d}_{k}$ ’s, the codes ${z}_{ik}$ ’s for for each sample $i$ can be obtained one by one by rewriting (18) as:

[TABLE]

where $e_{i}$ and $u_{ik}$ ’s are auxiliary variables.

Using ADMM, we introduce $\alpha_{i}$ and $\lambda_{ik}^{\prime}s$ as dual variables, then the augmented Lagrangian is constructed as

[TABLE]

At $\tau$ th iteration, ADMM alternately updates $z^{\tau}_{ik}$ ’s, $e_{i}^{\tau}$ , $u^{\tau}_{ik}$ ’s, $\alpha^{\tau}_{i}$ and $\lambda^{\tau}_{ik}$ ’s until convergence.

$z^{\tau}_{ik}$ ’s are updated as

[TABLE]

Similar to how we solve (19), we let $\eta^{\tau-1}_{i}=x_{i}-e^{\tau-1}_{i}-\frac{\alpha^{\tau-1}_{i}}{\rho}$ , and transform all variables to frequency domain as $\hat{d}_{k}=\mathcal{F}(C^{\top}d_{k})$ , $\hat{z}_{ik}=\mathcal{F}(z_{ik})$ , $\hat{u}_{ik}^{\tau-1}=\mathcal{F}(u_{ik}^{\tau-1})$ , $\hat{\lambda}_{ik}^{\tau-1}=\mathcal{F}(\lambda_{ik}^{\tau-1})$ , $\hat{\eta}_{i}^{\tau-1}=\mathcal{F}(\eta_{i}^{\tau-1})$ and $\hat{\alpha}_{i}^{\tau-1}=\mathcal{F}(\alpha_{i}^{\tau-1})$ . Then (21) can be written as:

[TABLE]

Using the reordering trick, we define $\hat{D}=[\hat{d}_{1},\dots,\hat{d}_{K}]$ , $\hat{Z}_{i}=[\hat{z}_{i1},\dots,\hat{z}_{iK}]$ , $\hat{U}^{\tau-1}_{i}=[\hat{u}^{\tau-1}_{i1},\dots,\hat{u}^{\tau-1}_{iK}]$ , and $\hat{\Lambda}^{\tau-1}_{i}=[\hat{\Lambda}^{\tau-1}_{i1},\dots,\hat{\Lambda}^{\tau-1}_{iK}]$ . Then $p$ th row $\hat{Z}^{\tau}_{i}(p,:)$ is updated as

[TABLE]

Then $z^{\tau}_{ik}$ is recovered as $\mathcal{F}^{-1}(\hat{z}^{\tau}_{ik})$ .

Each $e^{\tau}_{i}$ is then independently updated as

[TABLE]

Similar to (20), $e^{\tau}_{i}(p)$ is updated in closed-form as

[TABLE]

where $\nu={x}_{i}-\sum_{k=1}^{K}d_{k}*z^{\tau}_{ik}-\frac{\alpha^{\tau-1}_{i}}{\rho}$ .

Each $u^{\tau}_{ik}$ is updated as

[TABLE]

with closed-form solution:

[TABLE]

where $\psi=z^{\tau}_{ik}+\frac{\lambda^{\tau-1}_{ik}}{\rho}$ .

Finally, $\alpha^{\tau}_{i}$ and $\lambda^{\tau}_{ik}$ are updated as:

[TABLE]

Appendix B Solving (12) by BCD

As stated in the main text, we mention that solving (12) by niAPG is faster than BCD with both $d_{k}$ ’s and $z_{ik}$ ’s being solved by ADMM. Here we detail how to solve (12) by BCD.

B-A Filter Update

Given $z_{ik}$ ’s, $d_{k}$ ’s are obtained by solving the following objective:

[TABLE]

where $e_{gi}$ ’s and $v_{k}$ ’s are auxiliary variables. This can then be solved by ADMM. We first form the augmented Lagrangian as

[TABLE]

where $\rho$ is the ADMM penalty parameter, $\theta_{k}$ ’s and $\alpha_{gi}$ ’s are dual variables. At $\tau$ th iteration, ADMM alternately updates $d^{\tau}_{k}$ ’s, $e_{gi}^{\tau}$ ’s, $v^{\tau}_{k}$ ’s, $\alpha^{\tau}_{gi}$ ’s and $\theta^{\tau}_{k}$ ’s until convergence.

$d^{\tau}_{k}$ ’s are updated as

[TABLE]

Since convolution is more efficient in frequency domain, we update $d_{k}$ ’s therein. Let $\eta^{\tau-1}_{gi}=x_{i}-\mu_{g}-e^{\tau-1}_{gi}-\frac{\alpha^{\tau-1}_{gi}}{\rho}$ , and transform all variables to frequency domain (denoted as symbol with hat) using FFT as $\hat{d}_{k}=\mathcal{F}(C^{\top}d_{k})$ , $\hat{v}_{k}^{\tau-1}=\mathcal{F}(C^{\top}v_{k}^{\tau-1})$ , $\hat{\theta}_{k}^{\tau-1}=\mathcal{F}(C^{\top}\theta_{k}^{\tau-1})$ , $\hat{z}_{ik}=\mathcal{F}(z_{ik})$ , $\hat{\eta}_{gi}^{\tau-1}=\mathcal{F}(\eta_{gi}^{\tau-1})$ and $\hat{\alpha}_{gi}^{\tau-1}=\mathcal{F}(\alpha_{gi}^{\tau-1})$ . Using these frequency-domain variables, (22) can be written as:

[TABLE]

This can be solved by closed form solution. Using the reordering trick used in [21, 37], we first put all $d_{k}$ ’s in the columns of $\hat{D}=[\hat{d}_{1},\dots,\hat{d}_{K}]$ , as well as $\hat{V}^{\tau-1}=[\hat{v}^{\tau-1}_{1},\dots,\hat{v}^{\tau-1}_{K}]$ , $\hat{\Theta}^{\tau-1}=[\hat{\theta}^{\tau-1}_{1},\dots,\hat{\theta}^{\tau-1}_{K}]$ , and $\hat{Z}_{i}=[\hat{z}_{i1},\dots,\hat{z}_{iK}]$ . Then $p$ th row $\hat{D}^{\tau}(p,:)$ is updated as

[TABLE]

where $\zeta=\sum_{g=1}^{G}\hat{\eta}^{\tau-1}_{gi}$ . $d_{k}^{\tau}$ can be recovered as $C\mathcal{F}^{-1}(\hat{d}_{k}^{\tau})$ .

Each $e^{\tau}_{gi}$ is independently updated as

[TABLE]

with closed-form solution for $p$ th element:

[TABLE]

where $\nu={x}_{i}-\sum_{k=1}^{K}d^{\tau}_{k}*z_{ik}-\mu_{g}-\frac{\alpha^{\tau-1}_{gi}}{\rho}$ .

$v^{\tau}_{k}$ is updated as

[TABLE]

with closed-form solution

[TABLE]

Finally, $\alpha^{\tau}_{gi}$ and $\theta^{\tau}_{k}$ are updated as:

[TABLE]

B-B Code Update

In the codes update subproblem, $z_{ik}$ ’s can be independently updated for each $i$ . The objective to solve is:

[TABLE]

where $e_{gi}$ ’s and $u_{ik}$ ’s are auxiliary variables.

Using ADMM, we introduce $\alpha_{gi}$ ’s and $\lambda_{ik}$ ’s as dual variables, then the augmented Lagrangian is constructed as

[TABLE]

At $\tau$ th iteration, ADMM alternately updates $z^{\tau}_{ik}$ ’s, $e_{gi}^{\tau}$ ’s, $u^{\tau}_{ik}$ ’s, $\alpha^{\tau}_{gi}$ ’s and $\lambda^{\tau}_{ik}$ ’s until convergence.

$z^{\tau}_{ik}$ ’s are updated as

[TABLE]

Let $\eta^{\tau-1}_{gi}=x_{i}-\mu_{g}-e^{\tau-1}_{gi}-\frac{\alpha^{\tau-1}_{gi}}{\rho}$ , and transform all variables to frequency domain as $\hat{d}_{k}=\mathcal{F}(C^{\top}d_{k})$ , $\hat{z}_{ik}=\mathcal{F}(z_{ik})$ , $\hat{u}_{ik}^{\tau-1}=\mathcal{F}(u_{ik}^{\tau-1})$ , $\hat{\lambda}_{ik}^{\tau-1}=\mathcal{F}(\lambda_{ik}^{\tau-1})$ , $\hat{\eta}_{gi}^{\tau-1}=\mathcal{F}(\eta_{gi}^{\tau-1})$ and $\hat{\alpha}_{gi}^{\tau-1}=\mathcal{F}(\alpha_{gi}^{\tau-1})$ . Then (24) can be written as:

[TABLE]

Using the reordering trick, we define $\hat{D}=[\hat{d}_{1},\dots,\hat{d}_{K}]$ , $\hat{Z}_{i}=[\hat{z}_{i1},\dots,\hat{z}_{iK}]$ , $\hat{U}^{\tau-1}_{i}=[\hat{u}^{\tau-1}_{i1},\dots,\hat{u}^{\tau-1}_{iK}]$ , and $\hat{\Lambda}^{\tau-1}_{i}=[\hat{\Lambda}^{\tau-1}_{i1},\dots,\hat{\Lambda}^{\tau-1}_{iK}]$ . Then $p$ th row $\hat{Z}^{\tau}_{i}(p,:)$ is updated as

[TABLE]

where $\zeta=\sum_{g=1}^{G}\hat{\eta}_{gi}^{\tau-1}(p)$ . $z^{\tau}_{ik}$ is recovered as $\mathcal{F}^{-1}(\hat{z}^{\tau}_{ik})$ .

Each $e^{\tau}_{gi}$ is independently updated as

[TABLE]

Then similar to (23), $e^{\tau}_{gi}(p)$ is updated as

[TABLE]

where $\nu={x}_{i}-\sum_{k=1}^{K}d_{k}*z^{\tau}_{ik}-\mu_{g}-\frac{\alpha^{\tau-1}_{gi}}{\rho}$ .

Each $u^{\tau}_{ik}$ is updated as

[TABLE]

with closed-form solution:

[TABLE]

where $\psi=z^{\tau}_{ik}+\frac{\lambda^{\tau-1}_{ik}}{\rho}$ .

Finally, $\alpha^{\tau}_{gi}$ and $\lambda^{\tau}_{ik}$ are updated as:

[TABLE]

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing , vol. 54, no. 11, pp. 4311–4322, 2006.
2[2] H. Lee, A. Battle, R. Raina, and A. Ng, “Efficient sparse coding algorithms,” in Advances in Neural Information Processing Systems , 2007, pp. 801–808.
3[3] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Non-local sparse models for image restoration,” in International Conference on Computer Vision , 2009, pp. 2272–2279.
4[4] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in Conference on Computer Vision and Pattern Recognition , 2009, pp. 1794–1801.
5[5] M. Zeiler, D. Krishnan, G. Taylor, and R. Fergus, “Deconvolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition , 2010, pp. 2528–2535.
6[6] Y. Zhu and S. Lucey, “Convolutional sparse coding for trajectory reconstruction,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 37, no. 3, pp. 529–540, 2015.
7[7] F. Heide, W. Heidrich, and G. Wetzstein, “Fast and flexible convolutional sparse coding,” in IEEE Conference on Computer Vision and Pattern Recognition , 2015, pp. 5135–5143.
8[8] A. Cogliati, Z. Duan, and B. Wohlberg, “Context-dependent piano music transcription with convolutional sparse coding,” IEEE/ACM Transactions on Audio Speech and Language Processing , vol. 24, no. 12, pp. 2218–2230, 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

General Convolutional Sparse Coding with Unknown Noise

Abstract

Index Terms:

I Introduction

II Related Work

II-A Convolutional Sparse Coding

II-A1 Code Update

II-A2 Dictionary Update

II-B Proximal Algorithm

III Proposed Method

III-A GMM Noise

III-B Using Expectation Maximization (EM) Algorithm

III-B1 Updating πg\pi_{g}πg​’s, μg\mu_{g}μg​’s, Σg\Sigma_{g}Σg​’s

III-B2 Updating dkd_{k}dk​’s and zikz_{ik}zik​’s

III-C Solving the Weighted CSC Subproblem (12)

Lemma 1**.**

Proof.

Lemma 2**.**

Proposition 3**.**

Proof.

Lemma 4**.**

III-D Complexity Analysis

III-E Discussion with Existing CSC Works

IV Experiments

IV-A Baseline Algorithms

Stopping Criteria

IV-B Synthetic Data

IV-B1 Quantitative Evaluation

IV-B2 Visual Comparison

IV-B3 Solving (12): niAPG vs BCD

IV-C Local Field Potential Data

IV-D Retinal Image Data

V Conclusion

Appendix A CSC-ℓ1\ell_{1}ℓ1​: Detailed Algorithm

A-A Dictionary Update

A-B Code Update

Appendix B Solving (12) by BCD

B-A Filter Update

B-B Code Update

III-B1 Updating $\pi_{g}$ ’s, $\mu_{g}$ ’s, $\Sigma_{g}$ ’s

III-B2 Updating $d_{k}$ ’s and $z_{ik}$ ’s

Lemma 1.

Lemma 2.

Proposition 3.

Lemma 4.

Appendix A CSC- $\ell_{1}$ : Detailed Algorithm