Convolutional Dictionary Learning in Hierarchical Networks
Javier Zazo, Bahareh Tolooshams, Demba Ba

TL;DR
This paper introduces a hierarchical deep generative model for piecewise smooth signals, combining convolutional dictionary learning with a recursive multi-scale structure, and demonstrates its effectiveness on image data.
Contribution
It proposes a novel hierarchical convolutional dictionary learning framework that integrates sparse coding with deep neural network structures for modeling images.
Findings
Model captures multi-scale image features effectively.
Learned features improve classification performance on MNIST.
Algorithm efficiently alternates between coefficient estimation and filter updates.
Abstract
Filter banks are a popular tool for the analysis of piecewise smooth signals such as natural images. Motivated by the empirically observed properties of scale and detail coefficients of images in the wavelet domain, we propose a hierarchical deep generative model of piecewise smooth signals that is a recursion across scales: the low pass scale coefficients at one layer are obtained by filtering the scale coefficients at the next layer, and adding a high pass detail innovation obtained by filtering a sparse vector. This recursion describes a linear dynamic system that is a non-Gaussian Markov process across scales and is closely related to multilayer-convolutional sparse coding (ML-CSC) generative model for deep networks, except that our model allows for deeper architectures, and combines sparse and non-sparse signal representations. We propose an alternating minimization algorithm for…
| Network models | Train Set | Test Set | Parameters |
|---|---|---|---|
| 1 layer | 98.41 | 97.47 | 800 |
| 3 layers (tied) | 99.10 | 98.11 | 800 |
| 3 layers (ML-CSC) | 98.85 | 1,664,800 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image and Signal Denoising Methods · Music and Audio Processing
CONVOLUTIONAL DICTIONARY LEARNING IN HIERARCHICAL NETWORKS
Abstract
Filter banks are a popular tool for the analysis of piecewise smooth signals such as natural images. Motivated by the empirically observed properties of scale and detail coefficients of images in the wavelet domain, we propose a hierarchical deep generative model of piecewise smooth signals that is a recursion across scales: the low pass scale coefficients at one layer are obtained by filtering the scale coefficients at the next layer, and adding a high pass detail innovation obtained by filtering a sparse vector. This recursion describes a linear dynamic system that is a non-Gaussian Markov process across scales and is closely related to multilayer-convolutional sparse coding (ML-CSC) generative model for deep networks, except that our model allows for deeper architectures, and combines sparse and non-sparse signal representations. We propose an alternating minimization algorithm for learning the filters in this hierarchical model given observations at layer zero, e.g., natural images. The algorithm alternates between a coefficient-estimation step and a filter update step. The coefficient update step performs sparse (detail) and smooth (scale) coding and, when unfolded, leads to a deep neural network. We use MNIST to demonstrate the representation capabilities of the model, and its derived features (coefficients) for classification.
**Index Terms— ** Convolutional dictionary learning, sparse coding, deep networks, hierarchical models.
1 Introduction
With the advent of neural networks and current state-of-the-art performance on many machine learning applications [1], deep learning has become an ubiquitous framework with which to address problems in a wide range of domains. In particular, convolutional neural networks (CNNs) have been very successful for image classification [2], as they are able to reduce the number of trainable parameters, and still capture latent representations for discriminative problems.
However, little is known about how to obtain more efficient representations, or on how to train smaller networks that perform as good as CNNs, and not require exhaustive architecture search [3]. The importance of such advancement lies not only on obtaining more systematic, interpretable and efficient models, but also on reducing the economic and ecological footprint of neural networks [4].
Representation analysis with wavelets is a classical, interpretable and well understood theory that allows to decompose images/signals into a linear combination of basis functions at different scales to represent an image, and recover it perfectly via convolution operations [5, 6].
Alternatively, convolutional sparse coding (CSC) [8] and convolutional dictionary learning (CDL) [9] use sparse representation of images and learned dictionaries from the images of the database. More recently, [10] uses sparse representations and dictionaries obtained at different levels of a network.
This paper proposes a convolutional generative hierarchical model of signals, e.g., images, where the filters are learned from the data, and the images are decomposed into scale and detail signals. Additionally, there is a one-to-one correspondence between the sparse coding, or inference step, and deep CNNs. Such representation is inspired by a combination of wavelet analysis [6], sparse coding [11], and dictionary learning [12]. The scale consists of a dense signal, while the detail is a sparse representation that selects a few dictionary atoms.
Related work: A notable precursor to our work are deconvolutional networks [13], which described a generative hierarchical model inspired by CNNs where, starting with a sparse signal encoding, an image is obtained through the cascade of convolutional operations. However, a significant limitation of such networks is that the attainable levels of sparsity at each layer is reduced with the depth of the architecture.
The formal analysis of CSC is introduced by [14], which developed sufficient theoretical guarantees for exact sparse signal recovery. The sparse coding problem is non-convex and NP-hard [15], but when specific sparsity levels of the signal are satisfied (w.r.t. to the mutual coherence of the employed dictionaries) an optimal solution can be retrieved [11]. In such case, a relaxed formulation of the problem yields optimal results, making it possible to recover the generating signal. However, the feature maps (signal encodings) from stacks of convolutional layers become less sparse as the model becomes deeper, which renders solving the sparse inverse problem harder with increasing depth.
The work by Sulam et. al [10] addresses this problem of multilayer CSC (ML-CSC) by enforcing sparsity also across dictionaries. If the convolutional filters are very sparse, the subsequent layers will contain a reduced number of non-zeros, and some guarantees for a unique representation can be established. However, this model requires above 99% dictionary sparsity levels, and a large number of channels.
Our proposed model builds on top of existing results by considering an inverse problem that decomposes a source image into smooth and sparse signals jointly. As previously mentioned, this decomposition is quite natural in wavelet analysis, or scattering networks [7], and combines the tools of CSC and CDL together. We remark that the deconvolutional model [13], and the ML-CSC model [10] cannot have arbitrary depths, because of limitations on the signal sparsity levels. In contrast, our proposed model is not limited by depth as sparse details are added separately at every layer, and the recursion is on the scale, which does not have to be sparse.
2 Model Description
Given a scale signal and detail signals , we propose the following recursive generative model
[TABLE]
and assume the following latent prior distributions: {IEEEeqnarray}rCl\IEEEyesnumber ε_ℓ & ∼ N(0,σ_ℓ^2), ∀ℓ∈{1,…, L} \IEEEyessubnumber
u_ℓ ∼ Laplace(0, λ_ℓ), \IEEEyessubnumber
x_ℓ ∼ N(0, σ_x_ℓ^2).\IEEEyessubnumber Here, indicates layer index, where refers to the input signal, and refers to a deeper encoding. The model has total depth , and indicates the full convolution operation between two signals. We remark that the model given by Eq. (1) is a non-Gaussian Markovian dynamical system [16].
We represent filters with capital bold letters, i.e., and , and signals with lower case bold letters, i.e., , . Filters are tensors of dimensions , referring to the number of output channels, depth (input channels), height and width of filters, respectively. Explicitly, the convolution operation involves the following computation,
[TABLE]
where indexes the input feature map channels, and indicates output channels. We simplify the whole convolution operation by simply writing .
Eq. (1) indicates a recursive relation between input and output signals across layers. We refer () the scale signal at layer . It contributes smoothly to because of the averaging of convolution and its non-sparse form.
Eq. (2) further specifies priors for our model. For example, is assumed to follow a Laplace distribution of mean zero and diversity coefficient . This prior supports that is sparse, and adds high frequency information when convolved with to the signal . We refer to as detail signal, following wavelet terminology.
Hierachical model with tied filters: A simple modification can be made on Eq. (1) to resemble wavelet analysis, by tying filters and across layers for , similar to a wavelet being repeatedly used across decomposition levels. Additionally, up/down sampling operations can be incorporated to obtain a multiscale CSC/CDL model, although such analysis is out of the scope of this paper.
Figure 1 summarizes the model described by Eqs. (1) and (2) for three layers. As we noted in the related work (Section 1), our model does not have limitations in terms of attainable depth restricted by sparsity requirements. This is because all terms are independent across layers.
Finally, we note that by setting for all layers in Eq. (1) and establishing a prior such that (), our model simplifies to a deconvolutional network [13]. By further imposing sparsity on the filters , then our model simplifies to ML-CSC [10].
3 Hierarchical CSC
We can synthetize images from scale representation and detail signals across layers using Eq. (1). However, the analysis step requires solving the inverse problem to find appropriate encodings for an image across layers, i.e., and . We refer to such problem as hierarchical convolutional sparse coding (H-CSC).
The correct representation for scale and detail signals can be obtained by maximizing the log-posterior of the state-sequence and the input from our model (assuming fixed filters , ). The problem is coupled by the scale signals in , but for simplicity we solve each layer separately, similar to [10, 13]. This leads to the following problem
[TABLE]
[TABLE]
for every layer . Here, is given as input image, and subsequent estimates are obtained after solving Eq. (3).
Relationship with CNNs and ReLU activations: Eq. (3) incorporates two important regularizers into the model. The -norm enforces sparsity on signal, and accomplishes this result with high resemblance to standard CNNs. Consider the solution to the following problem,
[TABLE]
which can be written succinctly via soft-thresholding: {IEEEeqnarray}rCl S_λ(b)&= ReLU(b-λ)-ReLU(-b-λ).
The equivalence of the soft-thresholding operation, emphasizes that CNNs with ReLU activations have similar response as regularized problems like Eq. (3). This remark has been previously discussed in [17, 18], and justifies our motivation to induce a sparse prior on . A second regularizer conveys smoothness, but also guarantees uniqueness of the solution if the problem is not strongly convex.
FISTA derivation: Because Eq. (3) is non-smooth and convolutional, it can be solved via iterative proximal algorithms. These methods evaluate the gradient on the smooth part of the function, and apply a proximal operation on the non-smooth part (such as soft-thresholding). Accelerated algorithms like FISTA [19] incorporate past estimates in the update formula, and achieve faster convergence speeds.
The FISTA algorithm admits an efficient implementation on GPUs with known gradients, and can parallelize the computations across examples. This is particularly useful because GPUs implement convolution operations efficiently, and we can exploit these subroutines to reduce the computational requirements. The whole procedure is detailed in Algorithm 1, where denotes an appropriate step-size. The algorithmic derivation requires obtaining the gradients on Eq. (4) w.r.t. and . Specifically, we get
[TABLE]
[TABLE]
where refers to valid correlation between two signals.
4 Convolutional Dictionary Learning
The negative of the log-posterior that results from the generative model presented by Eqs. (1) and (2) is non-convex (bilinear) on both filters , and variables , , across layers. However, it is natural to propose an alternating optimization scheme that solves the problem on specific variables while fixing the rest of them. We already described in Section 3 how to solve on variables and for fixed filters and , which we referred as the analysis step.
To update the filter variables, a simple approach consists of fixing the scale and detail signals and updating the filters with a first-order gradient method. The loss function is a concatenation of example images solving Eq. (3). With a slight abuse of notation, we denote , the ’th encoding estimate across a database :
[TABLE]
Updating the filters and requires computing gradients from Eq. (8). Current approaches can exploit the autoencoder relation of a generative model (first finding a latent representation, then reconstructing) to obtain gradients through backpropagation and automatic differentiation [20]. This autoencoder formulation allows to use GPUs directly with appropriate automatic differentiation software.
A second approach computes gradients in the Fourier domain, updates the filters, and converts the updated filters back to time domain via inverse Fourier transform [9]. To the best of our knowledge, this procedure does not currently run efficiently on GPU, making it inappropriate to use for large datasets or images.
Filter gradients: Our approach computes the gradients directly on the loss function Eq. (8). This mechanism avoids automatic differentiation, which uses GPU memory and computation time, and also avoids converting from Fourier and back on every update. Filters are four dimensional tensors , and images three dimensional . To compute the gradient w.r.t. the filter, we can extend with an extra dimension the image, and operate on 3D correlation (rather than 2D). Then, we obtain the exact gradient expression as follows:
[TABLE]
Here, , denotes each of the variables with extra dimensions. The previous expression provides an efficient way to compute gradients directly on GPU and update filters accordingly. The gradient w.r.t. has an analog form as Eq. (9).
5 Experimental Results
To illustrate our results, we trained a set of hierarchical models with on MNIST database, comprising 60,000 training and 10,000 test grayscale digit images of pixels. The training step uses Algorithm 1 to solve the inverse problem, and then updates the filters with stochastic gradient descent following Section 4. The whole filter training procedure is unsupervised, and minimizes the reconstruction error of the input images.
After the model had converged, we used H-CSC features to train a multiclass logistic regression classifier, using and as inputs. Our classification results are shown in Table 1, reaching accuracies above 98.1% on the test set with a 3 layer hierarchical model and tied filters. We note that our classification results are similar with those reported by ML-CSC [10].
We also indicate the number of trainable parameters on each model in Table 1. The multiscale model has significantly less number of trainable parameters because the filters are repeated between layers. On the other hand, ML-CSC used a 3 layer network with 1,664,800 trainable parameters, although most of them are zero. What we show is that our model is capable of finding appropriate encodings with a recursive structure, tied filters and reduced number of parameters.
Simulation parameters: Parameters of the model and algorithm were chosen with grid search, for a total of 12 simulated models. The same paremeters were used in all layers. The FISTA regularization was varied between 1.0 and , and was selected as giving the best accuracy. This result indicates that high sparsity levels help obtain better classification performance.
The FISTA learning rate was chosen between 0.01 and 0.001, and yielded best results. Similarly, we unfolded the network for FISTA iterations at every layer for every dictionary gradient update, where the best parameter was . A larger number of FISTA iterations did not help achieve better accuracy results.
Finally, the scale filter was selected with spatial dimensions, and single input and output channels. The single channel was fixed to a constant value during training. This construction aimed to detect the low frequencies of the images. On the other hand, the detail filter had same spatial dimensions and 32 output channels. We found experimentally that reducing the number of scale filters improved the classification performance. Still, having at least a single scale filter allows our model to reach arbitrary depths.
Our implementations are build on PyTorch and run on 1080-GTX or Titan XP Nvidia GPUs.
Visualization: In Figures 2 and 3, we provide a visualization of the encoded features derived by the hierarchical model and untied filters. Figure 2 shows the scale representation and the corresponding learned filters with dimensions. We can observe that filters learn a varied set of features, where some seem to be low frequency, but others are high frequency as well. This visualization indicates that as the number of scale filters increases, the channels become more expressive and the representation error decreases.
Figure 2 displays the detail signal and filter on 16 channels. What we observe is that the encodings are very sparse, and the filters become more specialized and use more contrasting shapes. This shows that the kind of filters that are learned for the scale and detail signals are different.
6 Conclusion
We proposed a generative convolutional model to analyze signals based on smooth representation (scale), and sparse contributions (detail). This model used a recursive procedure where the scale signals were further decomposed into subsequent scale and detail components, providing higher order representations. Such decomposition used a hierarchical structure of filters, which can be shared between layers (tied) or independent (untied). Tied filters employed less trainable parameters and resembled the analytical process of wavelets. We evaluated the model on a classification task on MNIST and reached 98.1% accuracy only using 800 parameters. Future work will further address these systems adding up/down sampling operations to obtain multiscale representations with trainable filters.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,” nature , vol. 521, no. 7553, pp. 436, 2015.
- 2[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems , 2012, pp. 1097–1105.
- 3[3] Jonathan Frankle and Michael Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” ar Xiv preprint ar Xiv:1803.03635 , 2018.
- 4[4] Emma Strubell, Ananya Ganesh, and Andrew Mc Callum, “Energy and policy considerations for deep learning in nlp,” ar Xiv preprint ar Xiv:1906.02243 , 2019.
- 5[5] Stéphane Mallat, “A theory for multiresolution signal decomposition: The wavelet representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 11, pp. 674–693, 1989.
- 6[6] Stéphane Mallat, A wavelet tour of signal processing , Elsevier, 1999.
- 7[7] Joan Bruna and Stéphane Mallat, “Invariant scattering convolution networks,” IEEE transactions on pattern analysis and machine intelligence , vol. 35, no. 8, pp. 1872–1886, 2013.
- 8[8] Hilton Bristow, Anders Eriksson, and Simon Lucey, “Fast convolutional sparse coding,” in Proc. 2013 IEEE Conference on Computer Vision and Pattern Recognition , 2013, pp. 391–398.
