Stochastic Subsampling for Factorizing Huge Matrices

Arthur Mensch (PARIETAL; NEUROSPIN); Julien Mairal (Thoth); Bertrand; Thirion (PARIETAL; NEUROSPIN); Gael Varoquaux (NEUROSPIN; PARIETAL)

arXiv:1701.05363·stat.ML·November 15, 2017

Stochastic Subsampling for Factorizing Huge Matrices

Arthur Mensch (PARIETAL, NEUROSPIN), Julien Mairal (Thoth), Bertrand, Thirion (PARIETAL, NEUROSPIN), Gael Varoquaux (NEUROSPIN, PARIETAL)

PDF

1 Repo

TL;DR

This paper introduces a scalable matrix-factorization algorithm that efficiently handles massive matrices by streaming and subsampling, with proven convergence and demonstrated success on large real-world datasets.

Contribution

The proposed method combines streaming and subsampling techniques for scalable matrix factorization with convergence guarantees, suitable for large-scale data and various factor types.

Findings

01

Achieves significant speed-ups over state-of-the-art algorithms.

02

Successfully applied to 2 TB MRI data and 103 GB hyperspectral images.

03

Provides convergence guarantees to a stationary point.

Abstract

We present a matrix-factorization algorithm that scales to input matrices with both huge number of rows and columns. Learned factors may be sparse or dense and/or non-negative, which makes our algorithm suitable for dictionary learning, sparse component analysis, and non-negative matrix factorization. Our algorithm streams matrix columns while subsampling them to iteratively learn the matrix factors. At each iteration, the row dimension of a new sample is reduced by subsampling, resulting in lower time complexity compared to a simple streaming algorithm. Our method comes with convergence guarantees to reach a stationary point of the matrix-factorization problem. We demonstrate its efficiency on massive functional Magnetic Resonance Imaging data (2 TB), and on patches extracted from hyperspectral images (103 GB). For both problems, which involve different penalties on rows and columns,…

Figures15

Click any figure to enlarge with its caption.

Tables3

Table 1. TABLE I: Comparison of estimators used for code computation

Est.

𝜷_{t}

𝐆_{t}

Convergence

Extra

mem. cost

1^{st}

epoch

perform.

(a)

Masked

✓

(b)

Averaged

✓

n ​ k^{2}

✓

(c)

Averaged

Exact

✓

n ​ k

Table 2. TABLE II: Summary of experimental settings

Field	Functional MRI		Hyperspectral imaging
Dataset	ADHD	HCP	Patches from AVIRIS
Factors	$𝐃$ sparse, $𝐀$ dense		$𝐃$ dense, $𝐀$ sparse
# samples $n$	$7 \cdot 10^{3}$	$2 \cdot 10^{6}$	$2 \cdot 10^{6}$
# features $p$	$6 \cdot 10^{4}$	$2 \cdot 10^{5}$	$6 \cdot 10^{4}$
$𝐗$ size	2 GB	2 TB	103 GB
Use case ex.	Extracting predictive feature		Recognition / denoising

Table 3. TABLE III: Time to reach convergence ( < 1 % absent percent 1 <1\% test objective)

Dataset	ADHD		AVIRIS (NMF)		AVIRIS (DL)		HCP
Algorithm	omf	somf	omf	somf	omf	somf	omf	somf
Conv. time	$6 min$	$𝟐𝟖 s$	$2 h 30$	$𝟒𝟑 min$	$1 h 16$	$𝟏𝟏 min$	$3 h 50$	$𝟏𝟕 min$
Speed-up	11.8		3.36		6.80		13.31

Equations255

X \approx D A with D \in R^{p \times k} and A \in R^{k \times n},

X \approx D A with D \in R^{p \times k} and A \in R^{k \times n},

\min_{\begin{subarray}{c}{\mathbf{D}}\in\mathcal{C}\\ {\mathbf{A}}\in{\mathbb{R}}^{k\times n}\end{subarray}}\quad\sum_{i=1}^{n}\frac{1}{2}\bigl{\|}{\mathbf{x}}^{(i)}-{\mathbf{D}}\boldsymbol{\alpha}^{(i)}\bigr{\|}_{2}^{2}+\lambda\,\Omega(\boldsymbol{\alpha}^{(i)}),

\min_{\begin{subarray}{c}{\mathbf{D}}\in\mathcal{C}\\ {\mathbf{A}}\in{\mathbb{R}}^{k\times n}\end{subarray}}\quad\sum_{i=1}^{n}\frac{1}{2}\bigl{\|}{\mathbf{x}}^{(i)}-{\mathbf{D}}\boldsymbol{\alpha}^{(i)}\bigr{\|}_{2}^{2}+\lambda\,\Omega(\boldsymbol{\alpha}^{(i)}),

Ω (α) ≜ (1 - ν) ∥ α ∥_{1} + \frac{ν}{2} ∥ α ∥_{2}^{2},

Ω (α) ≜ (1 - ν) ∥ α ∥_{1} + \frac{ν}{2} ∥ α ∥_{2}^{2},

C ≜ {D \in R^{p \times k} /∥ d^{(j)} ∥ ≜ (1 - μ) ∥ d^{(j)} ∥_{1} + \frac{μ}{2} ∥ d^{(j)} ∥_{2}^{2} \leq 1} .

\displaystyle{\mathbf{D}}\in\operatornamewithlimits{\mathrm{argmin}}_{{\mathbf{D}}\in\mathcal{C}}\Big{(}\bar{f}({\mathbf{D}})\triangleq\frac{1}{n}\sum_{i=1}^{n}f({\mathbf{D}},{\mathbf{x}}^{(i)})\Big{)},

\displaystyle{\mathbf{D}}\in\operatornamewithlimits{\mathrm{argmin}}_{{\mathbf{D}}\in\mathcal{C}}\Big{(}\bar{f}({\mathbf{D}})\triangleq\frac{1}{n}\sum_{i=1}^{n}f({\mathbf{D}},{\mathbf{x}}^{(i)})\Big{)},

\displaystyle\text{where}\quad f({\mathbf{D}},{\mathbf{x}})\triangleq\min_{\boldsymbol{\alpha}\in{\mathbb{R}}^{k}}\frac{1}{2}\bigl{\|}{\mathbf{x}}-{\mathbf{D}}\boldsymbol{\alpha}\bigr{\|}_{2}^{2}+\lambda\,\Omega(\boldsymbol{\alpha}),

\min_{{\mathbf{A}}\in{\mathbb{R}}^{k\times n}}\sum_{i=1}^{n}\frac{1}{2}\bigl{\|}{\mathbf{x}}^{(i)}-{\mathbf{D}}\boldsymbol{\alpha}^{(i)}\bigr{\|}_{2}^{2}+\lambda\,\Omega(\boldsymbol{\alpha}^{(i)}).

\min_{{\mathbf{A}}\in{\mathbb{R}}^{k\times n}}\sum_{i=1}^{n}\frac{1}{2}\bigl{\|}{\mathbf{x}}^{(i)}-{\mathbf{D}}\boldsymbol{\alpha}^{(i)}\bigr{\|}_{2}^{2}+\lambda\,\Omega(\boldsymbol{\alpha}^{(i)}).

D \in C min \overset{ˉ}{f} (D) ≜ E_{x} [f (D, x)],

D \in C min \overset{ˉ}{f} (D) ≜ E_{x} [f (D, x)],

α_{t} ≜ α \in R^{k} argmin \frac{1}{2} ∥ x_{t} - D_{t - 1} α ∥_{2}^{2} + λ Ω (α) .

α_{t} ≜ α \in R^{k} argmin \frac{1}{2} ∥ x_{t} - D_{t - 1} α ∥_{2}^{2} + λ Ω (α) .

{\mathbf{D}}_{t}\in\operatornamewithlimits{\mathrm{argmin}}_{{\mathbf{D}}\in\mathcal{C}}\Bigl{(}\bar{g}_{t}({\mathbf{D}})\triangleq\frac{1}{t}\sum_{s=1}^{t}\frac{1}{2}\bigl{\|}{\mathbf{x}}_{s}-{\mathbf{D}}\boldsymbol{\alpha}_{s}\bigr{\|}_{2}^{2}+\lambda\Omega(\boldsymbol{\alpha}_{s})\Bigr{)}.

{\mathbf{D}}_{t}\in\operatornamewithlimits{\mathrm{argmin}}_{{\mathbf{D}}\in\mathcal{C}}\Bigl{(}\bar{g}_{t}({\mathbf{D}})\triangleq\frac{1}{t}\sum_{s=1}^{t}\frac{1}{2}\bigl{\|}{\mathbf{x}}_{s}-{\mathbf{D}}\boldsymbol{\alpha}_{s}\bigr{\|}_{2}^{2}+\lambda\Omega(\boldsymbol{\alpha}_{s})\Bigr{)}.

\begin{split}\bar{\mathbf{C}}_{t}&=\Big{(}1-\frac{1}{t}\Big{)}\bar{\mathbf{C}}_{t-1}+\frac{1}{t}\boldsymbol{\alpha}_{t}\boldsymbol{\alpha}_{t}^{\top}.\\ \bar{\mathbf{B}}_{t}&=\Big{(}1-\frac{1}{t}\Big{)}\bar{\mathbf{B}}_{t-1}+\frac{1}{t}{\mathbf{x}}_{t}\boldsymbol{\alpha}_{t}^{\top}.\end{split}

\begin{split}\bar{\mathbf{C}}_{t}&=\Big{(}1-\frac{1}{t}\Big{)}\bar{\mathbf{C}}_{t-1}+\frac{1}{t}\boldsymbol{\alpha}_{t}\boldsymbol{\alpha}_{t}^{\top}.\\ \bar{\mathbf{B}}_{t}&=\Big{(}1-\frac{1}{t}\Big{)}\bar{\mathbf{B}}_{t-1}+\frac{1}{t}{\mathbf{x}}_{t}\boldsymbol{\alpha}_{t}^{\top}.\end{split}

D_{t} = D \in C argmin \frac{1}{2} Tr (D^{⊤} D \overset{ˉ}{C}_{t}) - Tr (D^{⊤} \overset{ˉ}{B}_{t}) .

D_{t} = D \in C argmin \frac{1}{2} Tr (D^{⊤} D \overset{ˉ}{C}_{t}) - Tr (D^{⊤} \overset{ˉ}{B}_{t}) .

D \to \frac{1}{2} Tr (D^{⊤} D \overset{ˉ}{C}_{t}^{⊤}) - Tr (D^{⊤} \overset{ˉ}{B}_{t}),

D \to \frac{1}{2} Tr (D^{⊤} D \overset{ˉ}{C}_{t}^{⊤}) - Tr (D^{⊤} \overset{ˉ}{B}_{t}),

\overset{ˉ}{B}_{t} = \frac{1}{t} s = 1 \sum t x_{s} α_{s}^{⊤} \overset{ˉ}{C}_{t} = \frac{1}{t} s = 1 \sum t α_{s} α_{s}^{⊤} .

\overset{ˉ}{B}_{t} = \frac{1}{t} s = 1 \sum t x_{s} α_{s}^{⊤} \overset{ˉ}{C}_{t} = \frac{1}{t} s = 1 \sum t α_{s} α_{s}^{⊤} .

\bar{f}_{t}({\mathbf{D}})\triangleq\frac{1}{t}\sum_{s=1}^{t}\min_{\boldsymbol{\alpha}\in{\mathbb{R}}^{p}}\frac{1}{2}\bigl{\|}{\mathbf{x}}_{s}-{\mathbf{D}}\boldsymbol{\alpha}\bigr{\|}_{2}^{2}\!+\lambda\Omega(\boldsymbol{\alpha})\leq\bar{g}_{t}({\mathbf{D}}).

\bar{f}_{t}({\mathbf{D}})\triangleq\frac{1}{t}\sum_{s=1}^{t}\min_{\boldsymbol{\alpha}\in{\mathbb{R}}^{p}}\frac{1}{2}\bigl{\|}{\mathbf{x}}_{s}-{\mathbf{D}}\boldsymbol{\alpha}\bigr{\|}_{2}^{2}\!+\lambda\Omega(\boldsymbol{\alpha})\leq\bar{g}_{t}({\mathbf{D}}).

g_{t} \geq f_{t}, g_{t} (θ_{t - 1}) = f_{t} (θ_{t - 1}) .

g_{t} \geq f_{t}, g_{t} (θ_{t - 1}) = f_{t} (θ_{t - 1}) .

\overset{g}{ˉ}_{t} = (1 - w_{t}) \overset{g}{ˉ}_{t - 1} + w_{t} g_{t} .

\overset{g}{ˉ}_{t} = (1 - w_{t}) \overset{g}{ˉ}_{t - 1} + w_{t} g_{t} .

θ_{t} = θ \in Θ argmin \overset{g}{ˉ}_{t} (θ) .

θ_{t} = θ \in Θ argmin \overset{g}{ˉ}_{t} (θ) .

\overset{ˉ}{f}_{t} ≜ (1 - w_{t}) \overset{ˉ}{f}_{t - 1} + w_{t} f_{t}, \overset{g}{ˉ}_{t} ≜ (1 - w_{t}) \overset{g}{ˉ}_{t - 1} + w_{t} g_{t},

\overset{ˉ}{f}_{t} ≜ (1 - w_{t}) \overset{ˉ}{f}_{t - 1} + w_{t} f_{t}, \overset{g}{ˉ}_{t} ≜ (1 - w_{t}) \overset{g}{ˉ}_{t - 1} + w_{t} g_{t},

f_{t} (D) g_{t} (D) ≜ α \in R^{k} min \frac{1}{2} ∥ x_{t} - D α ∥_{2}^{2} + λ Ω (α), ≜ \frac{1}{2} ∥ x_{t} - D α_{t} ∥_{2}^{2} + λ Ω (α_{t}) .

f_{t} (D) g_{t} (D) ≜ α \in R^{k} min \frac{1}{2} ∥ x_{t} - D α ∥_{2}^{2} + λ Ω (α), ≜ \frac{1}{2} ∥ x_{t} - D α_{t} ∥_{2}^{2} + λ Ω (α_{t}) .

{\mathbb{P}}\bigl{[}{\mathbf{M}}_{t}[j,j]=r\bigr{]}=\frac{1}{r},\;\;\;{\mathbb{P}}\bigl{[}{\mathbf{M}}_{t}[j,j]=0\bigr{]}=1-\frac{1}{r}.

{\mathbb{P}}\bigl{[}{\mathbf{M}}_{t}[j,j]=r\bigr{]}=\frac{1}{r},\;\;\;{\mathbb{P}}\bigl{[}{\mathbf{M}}_{t}[j,j]=0\bigr{]}=1-\frac{1}{r}.

{\mathbb{E}}\bigl{[}\|{\mathbf{M}}_{t}{\mathbf{x}}_{t}\|_{0}\bigr{]}=\frac{p}{r}=q\qquad{\mathbb{E}}\bigl{[}{\mathbf{M}}_{t}{\mathbf{x}}_{t}\bigr{]}={\mathbf{x}}_{t}.

{\mathbb{E}}\bigl{[}\|{\mathbf{M}}_{t}{\mathbf{x}}_{t}\|_{0}\bigr{]}=\frac{p}{r}=q\qquad{\mathbb{E}}\bigl{[}{\mathbf{M}}_{t}{\mathbf{x}}_{t}\bigr{]}={\mathbf{x}}_{t}.

c^{(i)}

c^{(i)}

β_{t}^{(i)}

G_{t}^{(i)}

α_{t} \leftarrow α \in R^{k} argmin \frac{1}{2} α^{⊤} G_{t} α - α^{⊤} β_{t} + λ Ω (α) .

α_{t} \leftarrow α \in R^{k} argmin \frac{1}{2} α^{⊤} G_{t} α - α^{⊤} β_{t} + λ Ω (α) .

\overset{ˉ}{C}_{t} P_{t} \overset{ˉ}{B}_{t} \leftarrow (1 - w_{t}) \overset{ˉ}{C}_{t - 1} + w_{t} α_{t} α_{t}^{⊤} . \leftarrow (1 - w_{t}) P_{t} \overset{ˉ}{B}_{t - 1} + w_{t} P_{t} x_{t} α_{t}^{⊤} .

\overset{ˉ}{C}_{t} P_{t} \overset{ˉ}{B}_{t} \leftarrow (1 - w_{t}) \overset{ˉ}{C}_{t - 1} + w_{t} α_{t} α_{t}^{⊤} . \leftarrow (1 - w_{t}) P_{t} \overset{ˉ}{B}_{t - 1} + w_{t} P_{t} x_{t} α_{t}^{⊤} .

P_{t} D_{t}

P_{t} D_{t}

P_{t}^{⊥} \overset{ˉ}{B}_{t}

α_{t}^{⋆} \in α argmin \frac{1}{2} α^{⊤} G_{t}^{⋆} α - α^{⊤} β_{t}^{⋆} + λ Ω (α),

α_{t}^{⋆} \in α argmin \frac{1}{2} α^{⊤} G_{t}^{⋆} α - α^{⊤} β_{t}^{⋆} + λ Ω (α),

G_{t} β_{t} = D_{t - 1}^{⊤} M_{t} D_{t - 1} = D_{t - 1}^{⊤} M_{t} x_{t} .

G_{t} β_{t} = D_{t - 1}^{⊤} M_{t} D_{t - 1} = D_{t - 1}^{⊤} M_{t} x_{t} .

α \in R^{k} min \frac{1}{2} ∥ M_{t} (x^{t} - D_{t - 1}^{⊤} α) ∥_{2}^{2} + λ Ω (α) .

α \in R^{k} min \frac{1}{2} ∥ M_{t} (x^{t} - D_{t - 1}^{⊤} α) ∥_{2}^{2} + λ Ω (α) .

β_{t}^{(i)} G_{t}^{(i)} = (1 - γ) G_{t - 1}^{(i)} + γ D_{t - 1}^{⊤} M_{t} x^{(i)} = (1 - γ) G_{t - 1}^{(i)} + γ D_{t - 1}^{⊤} M_{t} D_{t}^{(i)},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

arthurmensch/modl
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Stochastic Subsampling for

Factorizing Huge Matrices

Arthur Mensch, Julien Mairal,

Bertrand Thirion, and Gaël Varoquaux A. Mensch, B. Thirion, G. Varoquaux are with Parietal team, Inria, CEA, Paris-Saclay University, Neurospin, at Gif-sur-Yvette, France. J. Mairal is with Université Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK at Grenoble, France. The research leading to these results was supported by the ANR (MACARON project, ANR-14-CE23-0003-01 — NiConnect project, ANR-11-BINF-0004NiConnect). It has received funding from the European Union’s Horizon 2020 Framework Programme for Research and Innovation under Grant Agreement No 720270 (Human Brain Project SGA1). Corresponding author: Arthur Mensch ([email protected])

Abstract

We present a matrix-factorization algorithm that scales to input matrices with both huge number of rows and columns. Learned factors may be sparse or dense and/or non-negative, which makes our algorithm suitable for dictionary learning, sparse component analysis, and non-negative matrix factorization. Our algorithm streams matrix columns while subsampling them to iteratively learn the matrix factors. At each iteration, the row dimension of a new sample is reduced by subsampling, resulting in lower time complexity compared to a simple streaming algorithm. Our method comes with convergence guarantees to reach a stationary point of the matrix-factorization problem. We demonstrate its efficiency on massive functional Magnetic Resonance Imaging data (2 TB), and on patches extracted from hyperspectral images (103 GB). For both problems, which involve different penalties on rows and columns, we obtain significant speed-ups compared to state-of-the-art algorithms.

Index Terms:

Matrix factorization, dictionary learning, NMF, stochastic optimization, majorization-minimization, randomized methods, functional MRI, hyperspectral imaging

I Introduction

Matrix factorization is a flexible approach to uncover latent factors in low-rank or sparse models. With sparse factors, it is used in dictionary learning, and has proven very effective for denoising and visual feature encoding in signal and computer vision [see e.g., 1]. When the data admit a low-rank structure, matrix factorization has proven very powerful for various tasks such as matrix completion [2, 3], word embedding [4, 5], or network models [6]. It is flexible enough to accommodate a large set of constraints and regularizations, and has gained significant attention in scientific domains where interpretability is a key aspect, such as genetics [7] and neuroscience [8]. In this paper, our goal is to adapt matrix-factorization techniques to huge-dimensional datasets, i.e., with large number of columns $n$ and large number of rows $p$ . Specifically, our work is motivated by the rapid increase in sensor resolution, as in hyperspectral imaging or fMRI, and the challenge that the resulting high-dimensional signals pose to current algorithms.

As a widely-used model, the literature on matrix factorization is very rich and two main classes of formulations have emerged. The first one addresses a convex-optimization problem with a penalty promoting low-rank structures, such as the trace or max norms [2]. This formulation has strong theoretical guarantees [3], but lacks scalability for huge datasets or sparse factors. For these reasons, our paper is focused on a second type of approach, which relies on nonconvex optimization. Stochastic (or online) optimization methods have been developed in this setting. Unlike classical alternate minimization procedures, they learn matrix decompositions by observing a single matrix column (or row) at each iteration. In other words, they stream data along one matrix dimension. Their cost per iteration is significantly reduced, leading to faster convergence in various practical contexts. More precisely, two approaches have been particularly successful: stochastic gradient descent [9] and stochastic majorization-minimization methods [10, 11]. The former has been widely used for matrix completion [see 12, 13, 14, and references therein], while the latter has been used for dictionary learning with sparse and/or structured regularization [15]. Despite those efforts, stochastic algorithms for dictionary learning are currently unable to deal efficiently with matrices that are large in both dimensions.

We propose a new matrix-factorization algorithm that can handle such matrices. It builds upon the stochastic majorization-minimization framework of [10], which we generalize for our problem. In this framework, the objective function is minimized by iteratively improving an upper-bound surrogate of the function (majorization step) and minimizing it to obtain new estimates (minimization step). The core idea of our algorithm is to approximate these steps to perform them faster. We carefully introduce and control approximations, so to extend convergence results of [10] when neither the majorization nor the minimization step is performed exactly.

For this purpose, we borrow ideas from randomized methods in machine learning and signal processing. Indeed, quite orthogonally to stochastic optimization, efficient approaches to tackle the growth of dataset dimension have exploited random projections [16, 17] or sampling, reducing data dimension while preserving signal content. Large-scale datasets often have an intrinsic dimension which is significantly smaller than their ambient dimension. Good examples are biological datasets [18] and physical acquisitions with an underlying sparse structure enabling compressed sensing [19]. In this context, models can be learned using only random data summaries, also called sketches. For instance, randomized methods [see 20, for a review] are efficient to compute PCA [21], a classic matrix-factorization approach, and to solve constrained or penalized least-square problems [22, 23]. On a theoretical level, recent works on sketching [24, 25] have provided bounds on the risk of using random summaries in learning.

Using random projections as a pre-processing step is not appealing in our applicative context since factors learned on reduced data are not interpretable. On the other hand, it is possible to exploit random sampling to approximate the steps of online matrix factorization. Factors are learned in the original space whereas the dimension of each iteration is reduced together with the computational cost per iteration.

Contribution

The contribution of this paper is both practical and theoretical. We introduce a new matrix factorization algorithm, called subsampled online matrix factorization (somf), which is faster than state-of-the-art algorithms by an order of magnitude on large real-world datasets (hyperspectral images, large fMRI data). It leverages random sampling with stochastic optimization to learn sparse and dense factors more efficiently. To prove the convergence of somf, we extend the stochastic majorization-minimization framework [10] and make it robust to some time-saving approximations. We then show convergence guarantees for somf under reasonable assumptions. Finally, we propose an extensive empirical validation of the subsampling approach.

In a first version of this work [26] presented at the International Conference in Machine Learning (icml), we proposed an algorithm similar to somf, without any theoretical guarantees. The algorithm that we present here has such guarantees, which we express in a more general framework, stochastic majorization-minimization. It is validated for new sparsity settings and a new domain of application. An open-source efficient Python package is provided.

Notations

Matrices are written using bold capital letters and vectors using bold small letters (e.g., ${\mathbf{X}}$ , $\boldsymbol{\alpha}$ ). We use superscript to specify the column (sample or component) number, and write ${\mathbf{X}}=[{\mathbf{x}}^{(1)},\ldots,{\mathbf{x}}^{(n)}]$ . We use subscripts to specify the iteration number, as in ${\mathbf{x}}_{t}$ . The floating bar, as in $\bar{g}_{t}$ , is used to stress that a given value is an average over iterations, or an expectation. The superscript ⋆ is used to denote an exact value, when it has to be compared to an inexact value, e.g., to compare $\boldsymbol{\alpha}_{t}^{\star}$ (exact) to $\boldsymbol{\alpha}_{t}$ (approximation).

II Prior art: matrix factorization with stochastic

majorization-minimization

Below, we introduce the matrix-factorization problem and recall a specific stochastic algorithm to solve it observing one column (or a mini-batch) at every iteration. We cast this algorithm in the stochastic majorization-minimization framework [10], which we will use in the convergence analysis.

II-A Problem statement

In our setting, the goal of matrix factorization is to decompose a matrix ${\mathbf{X}}\in{\mathbb{R}}^{p\times n}$ — typically $n$ signals of dimension $p$ — as a product of two smaller matrices:

[TABLE]

with potential sparsity or structure requirements on ${\mathbf{D}}$ and ${\mathbf{A}}$ . In signal processing, sparsity is often enforced on the code ${\mathbf{A}}$ , in a problem called dictionary learning [27]. In such a case, the matrix ${\mathbf{D}}$ is called the “dictionary” and ${\mathbf{A}}$ the sparse code. We use this terminology throughout the paper.

Learning the factorization is typically performed by minimizing a quadratic data-fitting term, with constraints and/or penalties over the code and the dictionary:

[TABLE]

where ${\mathbf{A}}\triangleq[\boldsymbol{\alpha}^{(1)},\ldots,\boldsymbol{\alpha}^{(n)}]$ , $\mathcal{C}$ is a column-wise separable convex set of ${\mathbb{R}}^{p\times k}$ and $\Omega:{\mathbb{R}}^{p}\rightarrow{\mathbb{R}}$ is a penalty over the code. Both constraint set and penalty may enforce structure or sparsity, though $\mathcal{C}$ has traditionally been used as a technical requirement to ensure that the penalty on ${\mathbf{A}}$ does not vanish with ${\mathbf{D}}$ growing arbitrarily large. Two choices of $\mathcal{C}$ and $\Omega$ are of particular interest. The problem of dictionary learning sets $\mathcal{C}$ as the $\ell_{2}$ ball for each atom and $\Omega$ to be the $\ell_{1}$ norm. Due to the sparsifying effect of $\ell_{1}$ penalty [28], the dataset admits a sparse representation in the dictionary. On the opposite, finding a sparse set in which to represent a given dataset, with a goal akin to sparse PCA [29], requires to set as the $\ell_{1}$ ball for each atom and $\Omega$ to be the $\ell_{2}$ norm. Our work considers the elastic-net constraints and penalties [30], which encompass both special cases. Fixing $\nu$ and $\mu$ in $[0,1]$ , we denote by $\Omega(\cdot)$ and $\|\cdot\|$ the elastic-net penalty in ${\mathbb{R}}^{p}$ and ${\mathbb{R}}^{k}$ :

[TABLE]

Following [15], we can also enforce the positivity of ${\mathbf{D}}$ and/or ${\mathbf{A}}$ by replacing ${\mathbb{R}}$ by ${\mathbb{R}}^{+}$ in $\mathcal{C}$ , and adding positivity constraints on ${\mathbf{A}}$ in (1), as in non-negative sparse coding [31]. We rewrite (1) as an empirical risk minimization problem depending on the dictionary only. The matrix ${\mathbf{D}}$ solution of (1) is indeed obtained by minimizing the empirical risk $\bar{f}$

[TABLE]

and the matrix ${\mathbf{A}}$ is obtained by solving the linear regression

[TABLE]

The problem (1) is non-convex in the parameters $({\mathbf{D}},{\mathbf{A}})$ , and hence (3) is not convex. However, the problem (1) is convex in both ${\mathbf{D}}$ and ${\mathbf{A}}$ when fixing one variable and optimizing with respect to the other. As such, it is naturally solved by alternate minimization over ${\mathbf{D}}$ and ${\mathbf{A}}$ , which asymptotically provides a stationary point of (3). Yet, ${\mathbf{X}}$ has typically to be observed hundred of times before obtaining a good dictionary. Alternate minimization is therefore not adapted to datasets with many samples.

II-B Online matrix factorization

When ${\mathbf{X}}$ has a large number of columns but a limited number of rows, the stochastic optimization method of [15] outputs a good dictionary much more rapidly than alternate-minimization. In this setting [see 32], learning the dictionary is naturally formalized as an expected risk minimization

[TABLE]

where ${\mathbf{x}}$ is drawn from the data distribution and forms an i.i.d. stream $({\mathbf{x}}_{t})_{t}$ . In the finite-sample setting, (5) reduces to (3) when ${\mathbf{x}}_{t}$ is drawn uniformly at random from $\{{\mathbf{x}}^{(i)},i\in[1,n]\}$ . We then write $i_{t}$ the sample number selected at time $t$ .

The online matrix factorization algorithm proposed in [15] is summarized in Alg. 1. It draws a sample ${\mathbf{x}}_{t}$ at each iteration, and uses it to improve the current iterate ${\mathbf{D}}_{t-1}$ . For this, it first computes the code $\boldsymbol{\alpha}_{t}$ associated to ${\mathbf{x}}_{t}$ on the current dictionary:

[TABLE]

Then, it updates ${\mathbf{D}}_{t}$ to make it optimal in reconstructing past samples $({\mathbf{x}}_{s})_{s\leq t}$ from previously computed codes $(\boldsymbol{\alpha}_{s})_{s\leq t}$ :

[TABLE]

Importantly, minimizing $\bar{g}_{t}$ is equivalent to minimizing the quadratic function

[TABLE]

where $\bar{\mathbf{B}}_{t}$ and $\bar{\mathbf{C}}_{t}$ are small matrices that summarize previously seen samples and codes:

[TABLE]

As the constraints $\mathcal{C}$ have a separable structure per atom, [15] uses projected block coordinate descent to minimize $\bar{g}_{t}$ . The function gradient writes $\nabla\bar{g}_{t}({\mathbf{D}})={\mathbf{D}}\bar{\mathbf{C}}_{t}-\bar{\mathbf{B}}_{t}$ , and it is therefore enough to maintain $\bar{\mathbf{B}}_{t}$ and $\bar{\mathbf{C}}_{t}$ in memory to solve (7). $\bar{\mathbf{B}}_{t}$ and $\bar{\mathbf{C}}_{t}$ are updated online, using the rules (8) (Alg. 1).

The function $\bar{g}_{t}$ is an upper-bound surrogate of the true current empirical risk, whose definition involves the regression minima computed on current dictionary ${\mathbf{D}}$ :

[TABLE]

Using empirical processes theory [33], it is possible to show that minimizing $\bar{f}_{t}$ at each iteration asymptotically yields a stationary point of the expected risk (5). Unfortunately, minimizing (11) is expensive as it involves the computation of optimal current codes for every previously seen sample at each iteration, which boils down to naive alternate-minimization.

In contrast, $\bar{g}_{t}$ is much cheaper to minimize than $\bar{f}_{t}$ , using block coordinate descent. It is possible to show that $\bar{g}_{t}$ converges towards a locally tight upper-bound of the objective $\bar{f}_{t}$ and that minimizing $\bar{g}_{t}$ at each iteration also asymptotically yields a stationary point of the expected risk (5). This establishes the correctness of the online matrix factorization algorithm (omf). In practice, the omf algorithm performs a single pass of block coordinate descent: the minimization step is inexact. This heuristic will be justified by our theoretical contribution in Section IV.

Extensions

For efficiency, it is essential to use mini-batches $\{{\mathbf{x}}_{s},\,s\in\mathcal{T}_{t}\}$ of size $\eta$ instead of single samples in the iterations [15]. The surrogate parameters $\bar{\mathbf{B}}_{t}$ , $\bar{\mathbf{C}}_{t}$ are then updated by the mean value of $\{({\mathbf{x}}_{s}\boldsymbol{\alpha}_{s}^{\top},\boldsymbol{\alpha}_{s}\boldsymbol{\alpha}_{s}^{\top})\}_{s\in\mathcal{T}_{t}}$ over the batch. The optimal size of the mini-batches is usually close to $k$ . (8) uses the sequence of weights ${(\frac{1}{t})}_{t}$ to update parameters $\bar{\mathbf{B}}_{t}$ and $\bar{\mathbf{C}}_{t}$ . [15] replaces these weights with a sequence ${(w_{t})}_{t}$ , which can decay more slowly to give more importance to recent samples in $\bar{g}_{t}$ . These weights will prove important in our analysis.

II-C Stochastic majorization-minimization

Online matrix factorization belongs to a wider category of algorithms introduced in [10] that minimize locally tight upper-bounding surrogates instead of a more complex objective, in order to solve an expected risk minimization problem. Generalizing online matrix factorization, we introduce in Alg. 2 the stochastic majorization-minimization (smm) algorithm, which is at the core of our theoretical contribution.

In online matrix factorization, the true empirical risk functions $\bar{f}_{t}$ and their surrogates $\bar{g}_{t}$ follow the update rules, with generalized weight ${(w_{t})}_{t}$ set to ${(\frac{1}{t})}_{t}$ in (7) – (11):

[TABLE]

where the pointwise loss function and its surrogate are

[TABLE]

The function $g_{t}$ is a majorizing surrogate of $f_{t}$ : $g_{t}\geq f_{t}$ , and $g_{t}$ is tangent to $f_{t}$ in ${\mathbf{D}}_{t-1}$ , i.e, $g_{t}({\mathbf{D}}_{t-1})=f_{t}({\mathbf{D}}_{t-1})$ and $\nabla(g_{t}-f_{t})({\mathbf{D}}_{t-1})=0$ . At each step of online matrix factorization:

•

The surrogate $g_{t}$ is computed along with $\boldsymbol{\alpha}_{t}$ , using (6).

•

The parameters $\bar{\mathbf{B}}_{t},\bar{\mathbf{C}}_{t}$ are updated following (8). They define the aggregated surrogate $\bar{g}_{t}$ up to a constant.

•

The quadratic function $\bar{g}_{t}$ is minimized efficiently by block coordinate descent, using parameters $\bar{\mathbf{B}}_{t}$ and $\bar{\mathbf{C}}_{t}$ to compute its gradient.

The stochastic majorization-minimization framework simply formalizes the three steps above, for a larger variety of loss functions $f_{t}(\theta)\triangleq f(\theta,{\mathbf{x}}_{t})$ , where $\theta$ is the parameter we want to learn ( ${\mathbf{D}}$ in the online matrix factorization setting). At iteration $t$ , a surrogate $g_{t}$ of the loss $f_{t}$ is computed to update the aggregated surrogate $\bar{g}_{t}$ following (14). The surrogate functions $(g_{t})_{t}$ should be upper-bounds of loss functions $(f_{t})_{t}$ , tight in the current iterate $\theta_{t-1}$ (e.g., the dictionary ${\mathbf{D}}_{t-1}$ ). This simply means that $f_{t}(\theta_{t-1})=g_{t}(\theta_{t-1})$ and $\nabla(f_{t}-g_{t})(\theta_{t-1})=0$ . Computing $\bar{g}_{t}$ can be done if $g_{t}$ is defined simply, as in omf where it is linearly parametrized by $(\boldsymbol{\alpha}_{t}\boldsymbol{\alpha}_{t}^{\top},{\mathbf{x}}_{t}\boldsymbol{\alpha}_{t}^{\top})$ . $\bar{g}_{t}$ is then minimized to obtain a new iterate $\theta_{t}$ .

It can be shown following [10] that stochastic majorization-minimization algorithms find asymptotical stationary point of the expected risk ${\mathbb{E}}_{\mathbf{x}}[f(\theta,{\mathbf{x}})]$ under mild assumptions recalled in Section IV. smm admits the same mini-batch and decaying weight extensions (used in Alg. 2) as omf.

In this work, we extend the smm framework and allow both majorization and minimization steps to be approximated. As a side contribution, our extension proves that performing a single pass of block coordinate descent to update the dictionary, an important heuristic in [15], is indeed correct. We first introduce the new matrix factorization algorithm at the core of this paper and then present the extended smm framework.

III Stochastic

subsampling for high dimensional data decomposition

The online algorithm presented in Section II is very efficient to factorize matrices that have a large number of columns (i.e., with a large number of samples $n$ ), but a reasonable number of rows — the dataset is not very high dimensional. However, it is not designed to deal with very high number of rows: the cost of a single iteration depends linearly on $p$ . On terabyte-scale datasets from fMRI with $p=2\cdot 10^{5}$ features, the original online algorithm requires one week to reach convergence. This is a major motivation for designing new matrix factorization algorithms that scale in both directions.

In the large-sample regime $p\gg k$ , the underlying dimensionality of columns may be much lower than the actual $p$ : the rows of a single column drawn at random are therefore correlated and redundant. This guides us on how to scale online matrix factorization with regard to the number of rows:

•

The online algorithm omf uses a single column of (or mini-batch) of ${\mathbf{X}}$ at each iteration to enrich the average surrogate and update the whole dictionary.

•

We go a step beyond and use a fraction of a single column of ${\mathbf{X}}$ to refine a fraction of the dictionary.

More precisely, we draw a column and observe only some of its rows at each iteration, to refine these rows of the dictionary, as illustrated in Figure 1. To take into account all features from the dataset, rows are selected at random at each iteration: we call this technique stochastic subsampling. Stochastic subsampling reduces the efficiency of the dictionary update per iteration, as less information is incorporated in the current iterate ${\mathbf{D}}_{t}$ . On the other hand, with a correct design, the cost of a single iteration can be considerably reduced, as it grows with the number of observed features. Section V shows that the proposed algorithm is an order of magnitude faster than the original omf on large and redundant datasets.

First, we formalize the idea of working with a fraction of the $p$ rows at a single iteration. We adapt the online matrix factorization algorithm, to reduce the iteration cost by a factor close to the ratio of selected rows. This defines a new online algorithm, called subsampled online matrix factorization (somf). At each iteration, it uses $q$ rows of the column ${\mathbf{x}}_{t}$ to update the sequence of iterates ${({\mathbf{D}}_{t})}_{t}$ . As in Section II, we introduce a more general algorithm, stochastic approximate majorization-minimization (samm), of which somf is an instance. It extends the stochastic majorization-minimization framework, with similar theoretical guarantees but potentially faster convergence.

III-A Subsampled online matrix factorization

Formally, as in online matrix factorization, we consider a sample stream ${({\mathbf{x}}_{t})}_{t}$ in ${\mathbb{R}}^{p}$ that cycles onto a finite sample set $\{{\mathbf{x}}^{(i)},i\in[1,n]\}$ , and minimize the empirical risk (3).111Note that we solve the fully observed problem despite the use of subsampled data, unlike other recent work on low-rank factorization [34].

III-A1 Stochastic subsampling and algorithm outline

We want to reduce the time complexity of a single iteration. In the original algorithm, the complexity depends linearly on the sample dimension $p$ in three aspects:

•

${\mathbf{x}}_{t}\in{\mathbb{R}}^{p}$ is used to compute the code $\boldsymbol{\alpha}_{t}$ ,

•

it is used to update the surrogate parameters $\bar{\mathbf{B}}_{t}\in{\mathbb{R}}^{p\times k}$ ,

•

${\mathbf{D}}_{t}\in{\mathbb{R}}^{p\times k}$ is fully updated at each iteration.

Our algorithm reduces the dimensionality of these steps at each iteration, such that $p$ becomes $q=\frac{p}{r}$ in the time complexity analysis, where $r>1$ is a reduction factor. Formally, we randomly draw, at iteration $t$ , a mask ${\mathbf{M}}_{t}$ that “selects” a random subset of ${\mathbf{x}}_{t}$ . We use it to drop a part of the features of ${\mathbf{x}}_{t}$ and to “freeze” these features in dictionary ${\mathbf{D}}$ at iteration $t$ .

It is convenient to consider ${\mathbf{M}}_{t}$ as a ${\mathbb{R}}^{p\times p}$ random diagonal matrix, such that each coefficient is a Bernouilli variable with parameter $\frac{1}{r}$ , normalized to be $1$ in expectation. $\forall j\in[0,p-1]$ ,

[TABLE]

Thus, $r$ describes the average proportion of observed features and ${\mathbf{M}}_{t}{\mathbf{x}}_{t}$ is a non-biased, low-dimensional estimator of ${\mathbf{x}}_{t}$ :

[TABLE]

with $\|\cdot\|_{0}$ counting the number of non-zero coefficients. We define the pair of orthogonal projectors ${\mathbf{P}}_{t}\in{\mathbb{R}}^{q\times p}$ and ${\mathbf{P}}_{t}^{\perp}\in{\mathbb{R}}^{(p-q)\times p}$ that project ${\mathbb{R}}^{p}$ onto $\mathrm{Im}({\mathbf{M}}_{t})$ and $\mathrm{Ker}({\mathbf{M}}_{t})$ . In other words, ${\mathbf{P}}_{t}{\mathbf{Y}}$ and ${\mathbf{P}}_{t}^{\perp}{\mathbf{Y}}$ are the submatrices of ${\mathbf{Y}}\in{\mathbb{R}}^{p\times y}$ with rows respectively selected and not selected by ${\mathbf{M}}_{t}$ . In algorithms, ${\mathbf{P}}_{t}{\mathbf{Y}}\leftarrow{\mathbf{Z}}\in{\mathbb{R}}^{q\times n}$ assigns the rows of ${\mathbf{Z}}$ to the rows of ${\mathbf{Y}}$ selected by ${\mathbf{P}}_{t}$ , by an abuse of notation.

In brief, subsampled online matrix factorization, defined in Alg. 3, follows the outer loop of online matrix factorization, with the following major modifications at iteration $t$ :

•

it uses ${\mathbf{M}}_{t}{\mathbf{x}}_{t}$ and low-size statistics instead of ${\mathbf{x}}_{t}$ to estimate the code $\boldsymbol{\alpha}_{t}$ and the surrogate $g_{t}$ ,

•

it updates a subset of the dictionary ${\mathbf{P}}_{t}{\mathbf{D}}_{t-1}$ to reduce the surrogate value $\bar{g}_{t}({\mathbf{D}})$ . Relevant parameters of $\bar{g}_{t}$ are computed using ${\mathbf{P}}_{t}{\mathbf{x}}_{t}$ and $\boldsymbol{\alpha}_{t}$ only.

We now present somf in details. For comparison purpose, we write all variables that would be computed following the omf rules at iteration $t$ with a ⋆ superscript. For simplicity, in Alg. 3 and in the following paragraphs, we assume that we use one sample per iteration —in practice, we use mini-batches of size $\eta$ . The next derivations are transposable when a batch $I_{t}$ is drawn at iteration $t$ instead of a single sample $i_{t}$ .

III-A2 Code computation

In the omf algorithm presented in Section II, $\boldsymbol{\alpha}_{t}^{\star}$ is obtained by solving (6), namely

[TABLE]

where ${\mathbf{G}}_{t}^{\star}={\mathbf{D}}_{t-1}^{\top}{\mathbf{D}}_{t-1}$ and $\boldsymbol{\beta}_{t}^{\star}={\mathbf{D}}_{t-1}^{\top}{\mathbf{x}}_{t}$ . For large $p$ , the computation of ${\mathbf{G}}_{t}^{\star}$ and $\boldsymbol{\beta}_{t}^{\star}$ dominates the complexity of the regression step, which depends almost linearly on $p$ . To reduce this complexity, we use estimators for ${\mathbf{G}}_{t}^{\star}$ and $\boldsymbol{\beta}_{t}^{\star}$ , computed at a cost proportional to the reduced dimension $q$ . We propose three kinds of estimators with different properties.

Masked loss

The most simple unbiased estimation of ${\mathbf{G}}_{t}^{\star}$ and $\boldsymbol{\beta}_{t}^{\star}$ whose computation cost depends on $q$ is obtained by subsampling matrix products with ${\mathbf{M}}_{t}$ :

[TABLE]

This is the strategy proposed in [26]. We use ${\mathbf{G}}_{t}$ and $\boldsymbol{\beta}_{t}$ in (18), which amounts to minimize the masked loss

[TABLE]

${\mathbf{G}}_{t}$ and $\boldsymbol{\beta}_{t}$ are computed in a number of operations proportional to $q$ , which brings a speed-up factor of almost $r$ in the code computation for large $p$ . On large data, using estimators (a) instead of exact ${\mathbf{G}}_{t}^{\star}$ and $\boldsymbol{\beta}_{t}^{\star}$ proves very efficient during the first epochs (cycles over the columns).222Estimators (a) are also available in the infinite sample setting, when minimizing expected risk (5) from a i.i.d sample stream ${({\mathbf{x}}_{t})}_{t}$ . However, due to the masking, ${\mathbf{G}}_{t}$ and $\boldsymbol{\beta}_{t}$ are not consistent estimators: they do not converge to ${\mathbf{G}}_{t}^{\star}$ and $\boldsymbol{\beta}_{t}^{\star}$ for large $t$ , which breaks theoretical guarantees on the algorithm output. Empirical results in Section V-E show that the sequence of iterates approaches a critical point of the risk (3), but may then oscillate around it.

Averaging over epochs

At iteration $t$ , the sample ${\mathbf{x}}_{t}$ is drawn from a finite set of samples ${\{{\mathbf{x}}^{(i)}\}}_{i}$ . This allows to average estimators over previously seen samples and address the non-consistency issue of (a). Namely, we keep in memory $2n$ estimators, written ${({\mathbf{G}}_{t}^{(i)},\boldsymbol{\beta}^{(i)}_{t})}_{1\leq i\leq n}$ . We observe the sample $i=i_{t}$ at iteration $t$ and use it to update the $i$ -th estimators $\bar{\mathbf{G}}_{t}^{(i)}$ , $\bar{\boldsymbol{\beta}}^{(i)}_{t}$ following

[TABLE]

where $\gamma$ is a weight factor determined by the number of time the one sample $i$ has been previously observed at time $t$ . Precisely, given ${(\gamma_{c})}_{c}$ a decreasing sequence of weights,

[TABLE]

All others estimators ${\{{\mathbf{G}}^{(j)}_{t},\boldsymbol{\beta}^{(j)}_{t}\}}_{j\neq i}$ are left unchanged from iteration $t-1$ . The set ${\{{\mathbf{G}}_{t}^{(i)},\boldsymbol{\beta}^{(i)}_{t}\}}_{1\leq i\leq n}$ is used to define the averaged estimators

[TABLE]

where $\gamma_{s,t}^{(i)}=\gamma_{c^{(i)}_{t}}\prod_{s<t,{\mathbf{x}}_{s}={\mathbf{x}}^{(i)}}(1-\gamma_{c^{(i)}_{s}})$ . Using $\boldsymbol{\beta}_{t}$ and ${\mathbf{G}}_{t}$ in (18), $\boldsymbol{\alpha}_{t}$ minimizes the masked loss averaged over the previous iterations where sample $i$ appeared:

[TABLE]

The sequences ${({\mathbf{G}}_{t})}_{t}$ and ${(\boldsymbol{\beta}_{t})}_{t}$ are consistent estimations of ${({\mathbf{G}}_{t}^{\star})}_{t}$ and ${(\boldsymbol{\beta}_{t}^{\star})}_{t}$ — consistency arises from the fact that a single sample ${\mathbf{x}}^{(i)}$ is observed with different masks along iterations. Solving (24) is made closer and closer to solving (21), to ensure the correctness of the algorithm (see Section IV). Yet, computing the estimators (b) is no more costly than computing (a) and still permits to speed up a single iteration close to $r$ times. In the mini-batch setting, for every $i\in I_{t}$ , we use the estimators ${\mathbf{G}}_{t}^{(i)}$ and $\boldsymbol{\beta}_{t}^{(i)}$ to compute $\boldsymbol{\alpha}_{t}^{(i)}$ . This method has a memory cost of $\mathcal{O}(n\,k^{2})$ . This is reasonable compared to the dataset size333It is also possible to efficiently swap the estimators $({\mathbf{G}}_{t}^{(i)})_{i}$ on disk, as they are only accessed for $i=i_{t}$ at iteration $t$ . if $p\gg k^{2}$ .

Exact Gram computation

To reduce the memory usage, another strategy is to use the true Gram matrix ${\mathbf{G}}_{t}$ and the estimator $\boldsymbol{\beta}_{t}$ from (b):

[TABLE]

As previously, the consistency of ${(\boldsymbol{\beta}_{t})}_{t}$ ensures that (5) is correctly solved despite the approximation in ${(\boldsymbol{\alpha}_{t})}_{t}$ computation. With the partial dictionary update step we propose, it is possible to maintain ${\mathbf{G}}_{t}$ at a cost proportional to $q$ . The time complexity of the coding step is thus similarly reduced when replacing (b) or (c) estimators in (21), but the latter option has a memory usage in ${\mathcal{O}}(n\,k)$ . Although estimators (c) are slightly less performant in the first epochs, they are a good compromise between resource usage and convergence. We summarize the characteristics of the three estimators (a)–(c) in Table I, anticipating their empirical comparison in Section V.

Surrogate computation

The computation of $\boldsymbol{\alpha}_{t}$ using one of the estimators above defines a surrogate $g_{t}({\mathbf{D}})\triangleq\frac{1}{2}\|{\mathbf{x}}_{t}-{\mathbf{D}}\boldsymbol{\alpha}_{t}\|_{2}^{2}+\lambda\Omega(\boldsymbol{\alpha})$ , which we use to update the aggregated surrogate $\bar{g}_{t}\triangleq(1-w_{t})\bar{g}_{t-1}+w_{t}g_{t}$ , as in online matrix factorization. We follow (8) (with weights ${(w_{t})}_{t}$ ) to update the matrices $\bar{\mathbf{B}}_{t}$ and $\bar{\mathbf{C}}_{t}$ , which define $\bar{g}_{t}$ up to constant factors. The update of $\bar{\mathbf{B}}_{t}$ requires a number of operations proportional to $p$ . Fortunately, we will see in the next paragraph that it is possible to update ${\mathbf{P}}_{t}\bar{\mathbf{B}}_{t}$ in the main thread with a number of operation proportional to $q$ and to complete the update of ${\mathbf{P}}_{t}^{\perp}\bar{\mathbf{B}}_{t}$ in parallel with the dictionary update step.

Weight sequences

Specific $(w_{t})_{t}$ and $(\gamma_{c})_{c}$ in Alg. 3 are required. We provide then in Assumption (B) of the analysis: $w_{t}=\frac{1}{t^{u}}$ and $\gamma_{c}=\frac{1}{c^{v}}$ , where $u\in(\frac{11}{12},1)$ and $v\in\big{(}\frac{3}{4},3u-2\big{)}$ to ensure convergence. Weights have little impact on convergence speed in practice.

III-A3 Dictionary update

In the original online algorithm, the whole dictionnary ${\mathbf{D}}_{t-1}$ is updated at iteration $t$ . To reduce the time complexity of this step, we add a “freezing” constraint to the minimization (7) of $\bar{g}_{t}$ . Every row $r$ of ${\mathbf{D}}$ that corresponds to an unseen row $r$ at iteration $r$ (such that ${\mathbf{M}}_{t}[r,r]=0$ ) remains unchanged. This casts the problem (7) into a lower dimensional space. Formally, the freezing operation comes out as a additional constraint in (7):

[TABLE]

The constraints are separable into two blocks of rows. Recalling the notations of (2), for each atom ${\mathbf{d}}^{(j)}$ , the rules $\|{\mathbf{d}}^{(j)}\|\leq 1$ and ${\mathbf{P}}_{t}^{\perp}{\mathbf{d}}^{(j)}={\mathbf{P}}_{t}^{\perp}{\mathbf{d}}^{(j)}_{t-1}$ can indeed be rewritten

[TABLE]

Solving (25) is therefore equivalent to solving the following problem in ${\mathbb{R}}^{q\times k}$ , with ${\mathbf{B}}_{t}^{r}\triangleq{\mathbf{P}}_{t}{\mathbf{B}}_{t}$ ,

[TABLE]

The rows of ${\mathbf{D}}_{t}$ selected by ${\mathbf{P}}_{t}$ are then replaced with ${\mathbf{D}}^{r}$ , while the other rows of ${\mathbf{D}}_{t}$ are unchanged from iteration $t-1$ . Formally, ${\mathbf{P}}_{t}{\mathbf{D}}_{t}={\mathbf{D}}^{r}$ and ${\mathbf{P}}_{t}^{\perp}{\mathbf{D}}_{t}={\mathbf{P}}_{t}^{\perp}{\mathbf{D}}_{t-1}$ . We solve (27) by a projected block coordinate descent (BCD) similar to the one used in the original algorithm, but performed in a subspace of size $q$ . We compute each column $j$ of the gradient that we use in the block coordinate descent loop with $q\times k$ operations, as it writes ${\mathbf{D}}^{r}\bar{\mathbf{c}}_{t}^{(j)}-\bar{\mathbf{b}}_{t}^{r(j)}\in{\mathbb{R}}^{q}$ , where $\bar{\mathbf{c}}_{t}^{(j)}$ and $\bar{\mathbf{b}}_{t}^{r(j)}$ are the $j$ -th columns of $\bar{\mathbf{C}}_{t}$ and $\bar{\mathbf{B}}_{t}^{r}$ . Each reduced atom ${\mathbf{d}}^{r(j)}$ is projected onto the elastic-net ball of radius $r_{t}^{(j)}$ , at an average cost in ${\mathcal{O}}(q)$ following [15]. This makes the complexity of a single-column update proportional to $q$ . Performing the projection requires to keep in memory the values ${\{n_{t}^{(j)}\triangleq 1-\|{\mathbf{d}}_{t}^{(j)}\|\}}_{j}$ , which can be updated online at a negligible cost.

We provide the reduced dictionary update step in Alg. 4, where we use the function $\mathtt{enet\_projection}({\mathbf{u}},r)$ that performs the orthogonal projection of ${\mathbf{u}}\in{\mathbb{R}}^{q}$ onto the elastic-net ball of radius $r$ . As in the original algorithm, we perform a single pass over columns to solve (27). Dictionary update is now performed with a number of operations proportional to $q$ , instead of $p$ in the original algorithm. Thanks to the random nature of ${({\mathbf{M}}_{t})}_{t}$ , updating ${\mathbf{D}}_{t-1}$ into ${\mathbf{D}}_{t}$ reduces $\bar{g}_{t}$ enough to ensure convergence.

Gram matrix computation

Performing partial updates of ${\mathbf{D}}_{t}$ makes it possible to maintain the full Gram matrix ${\mathbf{G}}_{t}~{}=~{}{\mathbf{G}}_{t}^{\star}$ with a cost in ${\mathcal{O}}(q\,k^{2})$ per iteration, as mentioned in III-A2. It is indeed enough to compute the reduced Gram matrix ${\mathbf{D}}^{\top}{\mathbf{P}}_{t}{\mathbf{D}}$ before and after the dictionary update:

[TABLE]

Parallel surrogate computation

Performing block coordinate descent on $\bar{g}_{t}^{r}$ requires to access $\bar{\mathbf{B}}_{t}^{r}={\mathbf{P}}_{t}\bar{\mathbf{B}}_{t}$ only. Assuming we may use use more than two threads, this allows to parallelize the dictionary update step with the update of ${\mathbf{P}}_{t}^{\perp}\bar{\mathbf{B}}_{t}$ . In the main thread, we compute ${\mathbf{P}}_{t}\bar{\mathbf{B}}_{t}$ following

[TABLE]

which has a cost proportional to $q$ . Then, we update in parallel the dictionary and the rows of $\bar{\mathbf{B}}_{t}$ that are not selected by ${\mathbf{M}}_{t}$ :

[TABLE]

This update requires $k(p-q)\eta$ operations (one matrix-matrix product) for a mini-batch of size $\eta$ . In contrast, with appropriate implementation, the dictionary update step requires $4\,k\,q^{2}$ to $6\,k\,q^{2}$ operations, among which $2\,k\,q^{2}$ come from slower matrix-vector products. Assuming $k\sim\eta$ , updating $\bar{\mathbf{B}}_{t}$ is faster than updating the dictionary up to $r\sim 10$ , and performing (7) on a second thread is seamless in term of wall-clock time. More threads may be used for larger reduction or batch size.

III-A4 Subsampling and time complexity

Subsampling may be used in only some of the steps of Alg. 3, with the other steps following Alg. 1. Whether to use subsampling or not in each step depends on the trade-off between the computational speed-up it brings and the approximations it makes. It is useful to understand how complexity of omf evolves with $p$ . We write $s$ the average number of non-zero coefficients in ${(\boldsymbol{\alpha}_{t})}_{t}$ ( $s=k$ when $\Omega=\|\cdot\|_{2}^{2}$ ). omf complexity has three terms:

(i)

${\mathcal{O}}(p\,k^{2})$ : computation of the Gram matrix ${\mathbf{G}}_{t}$ , update of the dictionary ${\mathbf{D}}_{t}$ with block coordinate descent, 2. (ii)

${\mathcal{O}}(p\,k\,\eta)$ : computation of $\boldsymbol{\beta}_{t}={\mathbf{D}}_{t-1}^{\top}{\mathbf{x}}_{t}$ and of $\bar{\mathbf{B}}_{t}$ using ${\mathbf{x}}_{t}\boldsymbol{\alpha}_{t}^{\top}$ , 3. (iii)

${\mathcal{O}}(k\,s^{2}\,\eta)$ : computation of $\boldsymbol{\alpha}_{t}$ using ${\mathbf{G}}_{t}$ and $\boldsymbol{\beta}_{t}$ , using matrix inversion or elastic-net regression.

Using subsampling turns $p$ into $q=\frac{p}{r}$ in the expressions above. It improves single iteration time when the cost of regression ${\mathcal{O}}(k\,s^{2}\,\eta)$ is dominated by another term. This happens whenever $\frac{p}{r}>s^{2}$ , where $r$ is the reduction factor used in the algorithm. Subsampling can bring performance improvement up to $r\sim\frac{p}{s^{2}}$ . It can be introduced in either computations from (i) or (ii), or both. When using small batch size, i.e., when $\eta<k$ , computations from (i) dominates complexity, and subsampling should be first introduced in dictionary update (i), and for code computation (ii) beyond a certain reduction ratio. On the other hand, with large batch size $\eta>k$ , subsampling should be first introduced in code computation, then in the dictionary update step. The reasoning above ignore potentially large constants. The best trade-offs in using subsampling must be empirically determined, which we do in Section V.

III-B *Stochastic approximate

majorization-minimization*

The somf algorithm can be understood within the stochastic majorization-minimization framework. The modifications that we propose are indeed perturbations to the first and third steps of the smm presented in Algorithm 2:

•

The code is computed approximately: the surrogate is only an approximate majorizing surrogate of $f_{t}$ near ${\mathbf{D}}_{t-1}$ .

•

The surrogate objective is only reduced and not minimized, due to the added constraint and the fact that we perform only one pass of block coordinate descent.

We propose a new stochastic approximate majorization-minimization (samm) framework handling these perturbations:

•

A majorization step (12 – Alg. 2), computes an approximate surrogate of $f_{t}$ near $\theta_{t-1}$ : $g_{t}\approx g_{t}^{\star}$ , where $g_{t}$ is a true upper-bounding surrogate of $\bar{f}_{t}$ .

•

A minimization step (13 – Alg. 2), finds $\theta_{t}$ by reducing enough the objective $\bar{g}_{t}$ : $\theta_{t}\approx\theta_{t}^{\star}\triangleq\operatornamewithlimits{\mathrm{argmin}}_{\theta\in\Theta}\bar{g}_{t}(\theta)$ , which implies $\bar{g}_{t}(\theta_{t})\gtrsim\bar{g}_{t}(\theta_{t}^{\star})$ .

The samm framework is general, in the sense that approximations are not specified. The next section provides a theoretical analysis of the approximation of samm and establishes how somf is an instance of samm. It concludes by establishing Proposition 1, which provides convergence guarantees for somf, under the same assumptions made for omf in [15].

IV Convergence analysis

We establish the convergence of somf under reasonable assumptions. For the sake of clarity, we first state our principal result (Proposition 1), that guarantees somf convergence. It is a corollary of a more general result on samm algorithms. To present this broader result, we recall the theoretical guarantees of the stochastic majorization-minimization algorithm [10] (Proposition 2); then, we show how the algorithm can withstand pertubations (Proposition 3). Proofs are reported in Appendix A. samm convergence is proven before establishing somf convergence as a corollary of this broader result.

IV-A Convergence of somf

Similar to [15, 34], we show that the sequence of iterates $({\mathbf{D}}_{t})_{t}$ asymptotically reaches a critical point of the empirical risk (3). We introduce the same hypothesis on the code covariance estimation $\bar{\mathbf{C}}_{t}$ as in [15] and a similar one on ${\mathbf{G}}_{t}$ — they ensure strong convexity of the surrogate and boundedness of ${(\boldsymbol{\alpha}_{t})}_{t}$ . They do not cause any loss of generality as they are met in practice after a few iterations, if $r$ is chosen reasonably low, so that $q>k$ . The following hypothesis can also be guaranteed by adding small $\ell_{2}$ regularizations to $\bar{f}$ .

(A)

There exists $\rho>0$ such that for all $t>0$ , $\bar{\mathbf{C}}_{t},{\mathbf{G}}_{t}\succ\rho{\mathbf{I}}$ .

We further assume, that the weights ${(w_{t})}_{t}$ and ${(\gamma_{c})}_{c}$ decay at specific rates. We specify simple weight sequences, but the proofs can be adapted for more complex ones.

(B)

There exists $u\in(\frac{11}{12},1)$ and $v\in\big{(}\frac{3}{4},3u-2)$ such that, for all $t>0$ , $c>0$ , $w_{t}=t^{-u}$ , $\gamma_{c}\triangleq c^{-v}$ .

The following convergence result then applies to any sequence ${({\mathbf{D}}_{t})}_{t}$ produced by somf, using estimators (b) or (c). $\bar{f}$ is the empirical risk defined in (3).

Proposition 1 (somf convergence).

Under assumptions (A) and (B), $\bar{f}({\mathbf{D}}_{t})$ converges with probability one and every limit point ${\mathbf{D}}_{\infty}$ of ${({\mathbf{D}}_{t})}_{t}$ is a stationary point of $\bar{f}$ : for all ${\mathbf{D}}\in\mathcal{C}$

[TABLE]

This result applies for any positive subsampling ratio $r$ , which may be set arbitrarily high. However, selecting a reasonable ratio remains important for performance.

Proposition 1 is a corollary of a stronger result on samm algorithms. As it provides insights on the convergence mechanisms, we formalize this result in the following.

IV-B Basic assumptions and results on smm convergence

We first recall the main results on stochastic majorization-minimization algorithms, established in [10], under assumptions that we slightly tighten for our purpose. In our setting, we consider the empirical risk minimization problem

[TABLE]

where $f:{\mathbb{R}}^{K}\times\mathcal{X}\to{\mathbb{R}}$ is a loss function and

(C)

$\Theta\subset{\mathbb{R}}^{K}$ and the support $\mathcal{X}$ of the data are compact.

This is a special case of (5) where the samples ${({\mathbf{x}}_{t})}_{t}$ are drawn uniformly from the set $\{{\mathbf{x}}^{(i)}\}_{i}$ . The loss functions $f_{t}\triangleq f(\cdot,{\mathbf{x}}_{t})$ defined on ${\mathbb{R}}^{K}$ can be non-convex. We instead assume that they meet reasonable regularity conditions:

(D)

${(f_{t})}_{t}$ is uniformly $R$ -Lipschitz continuous on ${\mathbb{R}}^{K}$ and uniformly bounded on $\Theta$ .

(E)

The directional derivatives [35] $\nabla f_{t}(\theta,\theta^{\prime}-\theta)$ and $\nabla\bar{f}(\theta,\theta^{\prime}-\theta)$ exist for all $\theta$ and $\theta^{\prime}$ in ${\mathbb{R}}^{K}$ .

Assumption (E) allows to characterize the stationary points of problem (30), namely $\theta\in\Theta$ such that $\nabla\bar{f}(\theta,\theta^{\prime}-\theta)\geq 0$ for all $\theta^{\prime}\in\Theta$ — intuitively a point is stationary when there is no local direction in which the objective can be improved.

Let us now recall the definition of first-order surrogate functions used in the smm algorithm. ${(g_{t})}_{t}$ are selected in the set $\mathcal{S}_{\rho,L}(f_{t},\theta_{t-1})$ , hereby introduced.

Definition 1 (First-order surrogate function).

Given a function $f:{\mathbb{R}}^{K}\to{\mathbb{R}}$ , $\theta\in\Theta$ and $\rho,L>0$ , we define $\mathcal{S}_{\rho,L}(f,\theta)$ as the set of functions $g:{\mathbb{R}}^{K}\rightarrow{\mathbb{R}}$ such that

•

$g$ is majorizing $f$ on $\Theta$ and $g$ is $\rho$ -strongly convex,

•

$g$ and $f$ are tight at $\theta$ — i.e., $g(\theta)=f(\theta)$ , $g-f$ is differentiable, $\nabla(g-f)$ is $L$ -Lipschitz, $\nabla(g-f)(\theta)=0$ .

In omf, $g_{t}$ defined in (15) is a variational surrogate444In this case as in somf, $g_{t}$ is not $\rho$ -strongly convex but $\bar{g}_{t}$ is, thanks to assumption (A). This is sufficient in the proofs of convergence. of $f_{t}$ . We refer the reader to [36] for further examples of first-order surrogates. We also ensure that $\bar{g}_{t}$ should be parametrized and thus representable in memory. The following assumption is met in omf, as $\bar{g}_{t}$ is parametrized by the matrices $\bar{\mathbf{C}}_{t}$ and $\bar{\mathbf{B}}_{t}$ .

(F) Parametrized surrogates.

The surrogates $(\bar{g}_{t})_{t}$ are parametrized by vectors in a compact set $\mathcal{K}\subset{\mathbb{R}}^{P}$ . Namely, for all $t>0$ , there exists $\boldsymbol{\kappa}_{t}\in\mathcal{K}$ such that $\bar{g}_{t}$ is unequivocally defined as $g_{t}\triangleq\bar{g}_{\boldsymbol{\kappa}_{t}}$ .

Finally, we ensure that the weights ${(w_{t})}_{t}$ used in Alg. 2 decrease at a certain rate.

(G)

There exists $u\in(\frac{3}{4},1)$ such that $w_{t}=t^{-u}$ .

When $(\theta_{t})_{t}$ is the sequence yielded by Alg. 2, the following result (Proposition 3.4 in [10]) establishes the convergence of $(\bar{f}{(\theta_{t}))}_{t}$ and states that $\theta_{t}$ is asymptotically a stationary point of the finite sum problem (30), as a special case of the expected risk minimization problem (5).

Proposition 2 (Convergence of smm, from [10]).

Under assumptions (C) – (G), ${(\bar{f}(\theta_{t}))}_{t\geq 1}$ converges with probability one. Every limit point $\theta_{\infty}$ of ${(\theta_{t})}_{t}$ is a stationary point of the risk $\bar{f}$ defined in (30). That is,

[TABLE]

The correctness of the online matrix factorization algorithm can be deduced from this proposition.

IV-C Convergence of samm

We now introduce assumptions on the approximations made in samm, before extending the result of Proposition 2. We make hypotheses on both the surrogate computation (majorization) step and the iterate update (minimization) step. The principles of samm are illustrated in Figure 2, which provides a geometric interpretation of the approximations introduced in the following assumptions (H) and (I).

IV-C1 Approximate surrogate computation

The smm algorithm selects a surrogate for $f_{t}$ at point $\theta_{t-1}$ within the set $\mathcal{S}_{\rho,L}(f_{t},\theta_{t-1})$ . Surrogates within this set are tight at $\theta_{t-1}$ and greater than $f_{t}$ everywhere. In samm, we allow the use of surrogates that are only approximately majorizing $f_{t}$ and approximately tight at $\theta_{t-1}$ . This is indeed what somf does when using estimators in the code computation step. For that purpose, we introduce the set $\mathcal{T}_{\rho,L}(f,\theta,\epsilon)$ , that contains all functions $\epsilon$ /̄close of a surrogate in $\mathcal{S}_{\rho,L}(f,\theta)$ for the $\ell_{\infty}$ /̄norm:

Definition 2 (Approximate first-order surrogate function).

Given a function $f:{\mathbb{R}}^{K}\to{\mathbb{R}}$ , $\theta\in\Theta$ and $\epsilon>0$ , $\mathcal{T}_{\rho,L}(f,\theta,\epsilon)$ is the set of $\rho$ -strongly convex functions $g:{\mathbb{R}}^{K}\rightarrow{\mathbb{R}}$ such that

•

$g$ is $\epsilon$ -majorizing $f$ on $\Theta$ : $\forall\>\kappa\in\Theta,\,g(\kappa)-f(\kappa)\geq-\epsilon$ ,

•

$g$ and $f$ are $\epsilon$ -tight at $\theta$ — i.e., $g(\theta)-f(\theta)\leq\epsilon$ , $g-f$ is differentiable, $\nabla(g-f)$ is $L$ -lipschitz.

We assume that samm selects an approximative surrogate in $\mathcal{T}_{\rho,L}(f_{t},\theta_{t-1},\epsilon_{t})$ at each iteration, where ${(\epsilon_{t})}_{t}$ is a deterministic or random non-negative sequence that vanishes at a sufficient rate.

(H)

For all $t>0$ , there exists $\epsilon_{t}>0$ such that $g_{t}\in\mathcal{T}_{\rho,L}(f_{t},\theta_{t-1},\epsilon_{t})$ . There exists a constant $\eta>0$ such that ${\mathbb{E}}[\epsilon_{t}]\in\mathcal{O}(t^{2(u-1)-\eta})$ and $\epsilon_{t}\to_{\infty}0$ almost surely.

As illustrated on Figure 2, given the omf surrogate $g_{t}^{\star}\in\mathcal{S}_{\rho,L}(f_{t},\theta_{t-1})$ defined in (15), any function $g_{t}$ such that $\|g_{t}-~{}g_{t}^{\star}\|_{\infty}<\epsilon$ is in $\mathcal{T}_{\rho,L}(f_{t},\theta_{t-1},\epsilon)$ — e.g., where $g_{t}$ uses an approximate $\boldsymbol{\alpha}_{t}$ in (15). This assumption can also be met in matrix factorization settings with difficult code regularizations, that require to make code approximations.

IV-C2 Approximate surrogate minimization

We do not require $\theta_{t}$ to be the minimizer of $\bar{g}_{t}$ any longer, but ensure that the surrogate objective function $\bar{g}_{t}$ decreases “fast enough”. Namely, $\theta_{t}$ obtained from partial minimization should be closer to a minimizer of $\bar{g}_{t}$ than $\theta_{t-1}$ . We write ${(\mathcal{F}_{t})}_{t}$ and ${(\mathcal{F}_{t-\frac{1}{2}})}_{t}$ the filtrations induced by the past of the algorithm, respectively up to the end of iteration $t$ and up to the beginning of the minimization step in iteration $t$ . Then, we assume

(I)

For all $t>0$ , $\bar{g}_{t}(\theta_{t})<\bar{g}_{t}(\theta_{t-1})$ . There exists $\mu>0$ such that, for all $t>0$ , where $\theta_{t}^{\star}=\operatornamewithlimits{\mathrm{argmin}}_{\theta\in\Theta}\bar{g}_{t}(\theta)$ ,

[TABLE]

Assumption (I) is met by choosing an appropriate method for the inner $\bar{g}_{t}$ minimization step — a large variety of gradient-descent algorithms indeed have convergence rates of the form (32). In somf, the block coordinate descent with frozen coordinates indeed meet this property, relying on results from [37]. When both assumptions are met, samm enjoys the same convergence guarantees as smm.

IV-C3 Asymptotic convergence guarantee

The following proposition guarantees that the stationary point condition of Proposition 2 holds for the samm algorithm, despite the use of approximate surrogates and approximate minimization.

Proposition 3 (Convergence of samm).

Under assumptions (C) – (I), the conclusion of Proposition 2 holds for samm.

Assumption (H) is essential to bound the errors introduced by the sequence $(\epsilon_{t})_{t}$ in the proof of Proposition 3, while (I) is the key element to show that the sequence of iterates $(\theta_{t})_{t}$ is stable enough to ensure convergence. The result holds for any subsampling ratio $r$ , provided that (A) remains true.

IV-C4 Proving somf convergence

Assumptions (A) and (B) readily implies (C)–(G). With Proposition 3 at hand, proving Proposition 1 reduces to ensure that the surrogate sequence of somf meets (H) while its iterate sequence meets (I).

V Experiments

The somf algorithm is designed for datasets with large number of samples $n$ and large dimensionality $p$ . Indeed, as detailed in Section III-A, subsampling removes the computational bottlenecks that arise from high dimensionality. Proposition 1 establishes that the subsampling used in somf is safe, as it enjoys the same guarantees as omf. However, as with omf, no convergence rate is provided. We therefore perform a strong empirical validation of subsampling.

We tackle two different problems, in functional Magnetic Resonance Imaging (fMRI) and hyperspectral imaging. Both involve the factorization of very large matrices ${\mathbf{X}}$ with sparse factors. As the data we consider are huge, subsampling reduces the time of a single iteration by a factor close to $\frac{p}{q}$ . Yet it is also much redundant: somf makes little approximations and accessing only a fraction of the features per iteration should not hinder much the refinement of the dictionary. Hence high speed-ups are expected — and indeed obtained. All experiments can be reproduced using open-source code.

V-A Problems and datasets

V-A1 Functional MRI

Matrix factorization has long been used on functional Magnetic Resonance Imaging [18]. Data are temporal series of 3D images of brain activity and are decomposed into spatial modes capturing regions that activate synchronously. They form a matrix ${\mathbf{X}}$ where columns are the 3D images, and rows corresponds to voxels. Interesting dictionaries for neuroimaging capture spatially-localized components, with a few brain regions. This can be obtained by enforcing sparsity on the dictionary: we use an $\ell_{2}$ penalty and the elastic-net constraint. somf streams subsampled 3D brain records to learn the sparse dictionary ${\mathbf{D}}$ . Data can be huge: we use the whole HCP dataset [38], with $n=2.4\cdot 10^{6}$ (2000 records, 1 200 time points) and $p=2\cdot 10^{5}$ , totaling 2 TB of dense data. For comparison, we also use a smaller public dataset (ADHD200 [39]) with 40 records, $n=7000$ samples and $p=6\cdot 10^{4}$ voxels. Historically, brain decomposition have been obtained by minimizing the classical dictionary learning objective on transposed data [40]: the code ${\mathbf{A}}$ holds sparse spatial maps and voxel time-series are streamed. This is not a natural streaming order for fMRI data as ${\mathbf{X}}$ is stored columnwise on disk, which makes the sparse dictionary formulation more appealing. Importantly, we seek a low-rank factorization, to keep the decomposition interpretable — $k\sim 100\ll p$ .

V-A2 Hyperspectral imaging

Hyperspectral cameras acquire images with many channels that correspond to different spectral bands. They are used heavily in remote sensing (satellite imaging), and material study (microscopic imaging). They yield digital images with around $1$ million pixels, each associated with hundreds of spectral channels. Sparse matrix factorization has been widely used on these data for image classification [41, 42] and denoising [43, 44]. All methods rely on the extraction of full-band patches representing a local image neighborhood with all channels included. These patches are very high dimensional, due to the number of spectral bands. From one image of the AVIRIS project [45], we extract $n=2\cdot 10^{6}$ patches of size $16\times 16$ with $224$ channels, hence $p=6\cdot 10^{4}$ . A dense dictionary is learned from these patches. It should allow a sparse representation of samples: we either use the classical dictionary learning setting ( $\ell_{1}$ /elastic-net penalty), or further add positive constraints to the dictionary and codes: both methods may be used and deserved to be benchmarked. We seek a dictionary of reasonable size: we use $k\sim 256\ll p$ .

V-B Experimental design

To validate the introduction of subsampling and the usefulness of somf, we perform two major experiments.

•

We measure the performance of somf when increasing the reduction factor, and show benefits of stochastic dimension reduction on all datasets.

•

We assess the importance of subsampling in each of the steps of somf. We compare the different approaches proposed for code computation.

Validation

We compute the objective function $\eqref{eq:empirical-risk}$ over a test set to rule out any overfitting effect — a dictionary should be a good representation of unseen samples. This criterion is always plotted against wall-clock time, as we are interested in the performance of somf for practitioners.

Tools

To perform a valid benchmark, we implement omf and somf using Cython [46] We use coordinate descent [47] to solve Lasso problems with optional positivity constraints. Code computation is parallelized to handle mini-batches. Experiments use scikit-learn [48] for numerics, and nilearn [49] for handling fMRI data. We have released the code in an open-source Python package555https://github.com/arthurmensch/modl. Experiments were run on 3 cores of an Intel Xeon 2.6GHz, in which case computing ${\mathbf{P}}_{t}^{\perp}\bar{\mathbf{B}}_{t}$ is faster than updating ${\mathbf{P}}_{t}{\mathbf{D}}_{t}$ .

Parameter setting

Setting the number of components $k$ and the amount of regularization $\lambda$ is a hard problem in the absence of ground truth. Those are typically set by cross-validation when matrix factorization is part of a supervised pipeline. For fMRI, we set $k=70$ to obtain interpretable networks, and set $\lambda$ so that the decomposition approximately covers the whole brain (i.e., every map is $\frac{k}{70})$ sparse). For hyperspectral images, we set $k=256$ and select $\lambda$ to obtain a dictionary on which codes are around $3\%$ sparse. We cycle randomly through the data (fMRI records, image patches) until convergence, using mini-batches of size $\eta=200$ for HCP and AVIRIS, and $\eta=50$ for ADHD (small number of samples). Hyperspectral patches are normalized in the dictionary learning setting, but not in the non-negative setting — the classical pre-conditioning for each case. We use $u=0.917$ and $v=0.751$ for weight sequences.

V-C Reduction brings speed-up at all data scales

We benchmark somf for various reduction factors against the original online matrix factorization algorithm omf [15], on the three presented datasets. We stream data in the same order for all reduction factors. Using variant (c) (true Gram matrix, averaged $\boldsymbol{\beta}_{t}$ ) performs slightly better on fMRI datasets, whereas (b) (averaged Gram matrix and $\boldsymbol{\beta}_{t}$ ) is slightly faster for hyperspectral decomposition. For comparison purpose, we display results using estimators (b) only.

Figure 3 plots the test objective against CPU time. First, we observe that all algorithms find dictionaries with very close objective function values for all reduction factors, on each dataset. This is not a trivial observation as the matrix factorization problem (3) is not convex and different runs of omf and somf may converge towards minima with different values. Second, and most importantly, somf provides significant improvements in convergence speed for three different sizes of data and three different factorization settings. Both observations confirm the relevance of the subsampling approach.

Quantitatively, we summarize the speed-ups obtained in Table III. On fMRI data, on both large and medium datasets, somf provides more than an order of magnitude speed-up. Practitioners working on datasets akin to HCP can decompose their data in 20 minutes instead of $4\,\textrm{h}$ previously, while working on a single machine. We obtain the highest speed-ups for the largest dataset — accounting for the extra redundancy that usually appears when dataset size increase. Up to $r\sim 8$ , speed-up is of the order of $r$ — subsampling induces little noise in the iterate sequence, compared to omf. Hyperspectral decomposition is performed near $7\times$ faster than with omf in the classical dictionary learning setting, and $3\times$ in the non-negative setting, which further demonstrates the versatility of somf. Qualitatively, given a certain time budget, Figure 4 compares the results of omf and the results of somf with a subsampling ratio $r=24$ , in the non-negative setting. Our algorithm yields a valid smooth bank of filters much faster. The same comparison has been made for fMRI in [26].

Comparison with stochastic gradient descent

It is possible to solve (3) using the projected stochastic gradient (sgd) algorithm [50]. On all tested settings, for high precision convergence, sgd (with the best step-size among a grid) is slower than omf and even slower than somf. In the dictionary learning setting, sgd is somewhat faster than omf but slower than somf in the first epochs. Compared to somf and omf, sgd further requires to select the step-size by grid search.

Limitations

Table III reports convergence time within $1\%$ , which is enough for application in practice. somf is less beneficial when setting very high precision: for convergence within $0.01\%$ , speed-up for HCP is $3.4$ . This is expected as somf trades speed for approximation. For high precision convergence, the reduction ratio can be reduced after a few epochs. As expected, there exists an optimal reduction ratio, depending on the problem and precision, beyond which performance reduces: $r=12$ yields better results than $r=24$ on AVIRIS (dictionary learning) and ADHD, for $1\%$ precision.

Our first experiment establishes the power of stochastic subsampling as a whole. In the following two experiments, we refine our analysis to show that subsampling is indeed useful in the three steps of online matrix factorization.

V-D For each step of somf, subsampling removes a bottleneck

In Section III, we have provided theoretical guidelines on when to introduce subsampling in each of the three steps of an iteration of somf. This analysis predicts that, for $\eta\sim k$ , we should first use partial dictionary update, before using approximate code computation and asynchronous parameter aggregation. We verify this by measuring the time spent by somf on each of the updates for various reduction factors, on the HCP dataset. Results are presented in Figure 5. We observe that block coordinate descent is indeed the bottleneck in omf. Introducing partial dictionary update removes this bottleneck, and as the reduction factor increases, code computation and surrogate aggregation becomes the major bottlenecks. Introducing subsampling as described in somf overcomes these bottlenecks, which rationalizes all steps of somf from a computational point of view.

V-E Code subsampling becomes useful for high reduction

It remains to assess the performance of approximate code computation and averaging techniques used in somf. Indeed, subsampling for code computation introduces noise that may undermine the computational speed-up. To understand the impact of approximate code computation, we compare three strategies to compute $(\boldsymbol{\alpha}_{t})_{t}$ on the HCP dataset. First, we compute ${(\boldsymbol{\alpha}_{t}^{\star})}_{t}$ from ${({\mathbf{x}}_{t})}_{t}$ using (21). Subsampling is thus used only in dictionary update. Second, we rely on masked, non-consistent estimators (a), as in [26] — this breaks convergence guarantees. Third, we use averaged estimators $(\boldsymbol{\beta}_{t},{\mathbf{G}}_{t})$ from (c) to reduce the variance in ${(\boldsymbol{\alpha}_{t})}_{t}$ computation.

Fig. 6 compares the three strategies for $r\in\{12,24\}$ . Partial minimization at each step is the most important part to accelerate convergence: subsampling the dictionary updates already allows to outperforms omf. This is expected, as dictionary update constitutes the main bottleneck of omf in large-scale settings. Yet, for large reduction factors, using subsampling in code computation is important to further accelerate convergence. This clearly appears when comparing the plain and dashed black curves. Using past estimates to better approximate ${(\boldsymbol{\alpha}_{t})}_{t}$ yields faster convergence than the non-converging, masked loss strategy (a) proposed in [26].

VI Conclusion

In this paper, we introduce somf, a matrix-factorization algorithm that can handle input data with very large number of rows and columns. It leverages subsampling within the inner loop of a streaming algorithm to make iterations faster and accelerate convergence. We show that somf provides a stationary point of the non-convex matrix factorization problem. To prove this result, we extend the stochastic majorization-minimization framework to two major approximations. We assess the performance of somf on real-world large-scale problems, with different sparsity/positivity requirements on learned factors. In particular, on fMRI and hyperspectral data decomposition, we show that the use of subsampling can speed-up decomposition up to $13$ times. The larger the dataset, the more somf outperforms state-of-the art techniques, which is very promising for future applications. This calls for adaptation of our approach to learn more complex models.

Appendix A Proofs of convergence

This appendix contains the detailed proofs of Proposition 3 and Proposition 1. We first introduce three lemmas that will be crucial to prove samm convergence, before establishing it by proving Proposition 3. Finally, we show that somf is indeed an instance of samm (i.e. meets the assumptions (C)–(I)), proving Proposition 1.

A-A Basic properties of the surrogates, estimate stability

We derive an important result on the stability and optimality of the sequence $(\theta_{t})_{t}$ , formalized in Lemma 3 — introduced in the main text. We first introduce a numerical lemma on the boundedness of well-behaved determistic and random sequence. The proof is detailed in Appendix B.

Lemma 1 (Bounded quasi-geometric sequences).

Let ${(x_{t})}_{t}$ be a sequence in ${\mathbb{R}}^{+}$ , $u:{\mathbb{R}}\times{\mathbb{R}}\to{\mathbb{R}}$ , $t_{0}\in\mathbb{N}$ and $\alpha\in[0,1)$ such that, for all $t\geq t_{0},\,x_{t}\leq\alpha x_{t-1}+u(x_{t},x_{t-1})$ , where $u(x,y)\in o(x+y)$ for $x,y\to\infty$ . Then ${(x_{t})}_{t}$ is bounded.

Let now $(X_{t})_{t}$ be a random sequence in ${\mathbb{R}}^{+}$ , such that ${\mathbb{E}}[X_{t}]<\infty$ . We define ${(\mathcal{F}_{t})}_{t}$ the filtration adapted to ${(X_{t})}_{t}$ . If, for all $t>t_{0}$ , there exists a $\sigma$ -algebra $\mathcal{F}_{t^{\prime}}$ such that $\mathcal{F}_{t-1}\subseteq\mathcal{F}_{t^{\prime}}\subseteq\mathcal{F}_{t}$ and

[TABLE]

then $(X_{t})_{t}$ is bounded almost surely.

We first derive some properties of the approximate surrogate functions used in samm. The proof is adapted from [10].

Lemma 2 (Basic properties of approximate surrogate functions).

Consider any sequence of iterates ${(\theta_{t})}_{t}$ and assume there exists $\epsilon>0$ such that $g_{t}\in\mathcal{T}_{L,\rho}(f_{t},\theta_{t-1},\epsilon)$ for all $t\geq 1$ . Define $h_{t}\triangleq g_{t}-f_{t}$ for all $t\geq 1$ , $\bar{h}_{0}\triangleq h_{0}$ and $\bar{h}_{t}\triangleq(1-w_{t})\bar{h}_{t-1}+w_{t}h_{t}$ . Under assumptions (D) – (G),

(i)

$(\nabla{h_{t}(\theta_{t-1}))}_{t>0}$ is uniformly bounded and there exists $R^{\prime}$ such that ${\{\nabla h_{t}\}}_{t}$ is uniformly bounded by $R^{\prime}$ . 2. (ii)

${(h_{t})}_{t}$ and ${(\bar{h}_{t})}_{t}$ are uniformly $R^{\prime}$ -Lipschitz, ${(g_{t})}_{t}$ and ${(\bar{g}_{t})}_{t}$ are uniformly $(R+R^{\prime})$ -Lipschitz.

Proof.

We first prove (i). We set $\alpha>0$ and define $\theta^{\prime}=\theta_{t}-\alpha\frac{\nabla h_{t}(\theta_{t})}{\|\nabla h_{t}(\theta_{t})\|_{2}}$ . As $h_{t}$ has a $L$ -Lipschitz gradient on ${\mathbb{R}}^{K}$ , using Taylor’s inequality (see Appendix B)

[TABLE]

where we use $h_{t}(\theta_{t})<\epsilon$ and $-h_{t}(\theta_{t}^{\prime})\leq\epsilon$ from the assumption $g_{t}\in\mathcal{T}_{L,\rho}(f_{t},\theta_{t-1},\epsilon)$ . Moreover, by definition, $\nabla h_{t}$ exists and is $L$ -lipschitz for all $t$ . Therefore, $\forall\,t\geq 1$ ,

[TABLE]

Since $\Theta$ is compact and ${({\|\nabla h_{t}(\theta_{t})\|}_{2})}_{t\geq 1}$ is bounded in (34), $\nabla h_{t}$ is bounded by $R^{\prime}$ independent of $t$ . (ii) follows by basic considerations on Lipschitz functions. ∎

Finally, we prove a result on the stability of the estimates, that derives from combining the properties of $(g_{t})_{t}$ and the geometric decrease assumption (I).

Lemma 3 (Estimate stability under samm approximation).

In the same setting as Lemma 2, with the additional assumption (I) (expected linear decrease of $\bar{g}_{t}$ suboptimality), the sequence $\|\theta_{t}-\theta_{t-1}\|_{2}$ converges to [math] as fast as ${(w_{t})}_{t}$ , and $\theta_{t}$ is asymptotically an exact minimizer. Namely, almost surely,

[TABLE]

Proof.

We first establish the result when a deterministic version of (I) holds, as it makes derivations simpler to follow.

A-A1 Determistic decrease rate

We temporarily assume that decays are deterministic.

(Idet)

For all $t>0$ , $\bar{g}_{t}(\theta_{t})<\bar{g}_{t}(\theta_{t-1})$ . Moreover, there exists $\mu>0$ such that, for all $t>0$

[TABLE]

We introduce the following auxiliary positive values, that we will seek to bound in the proof:

[TABLE]

Our goal is to bound $A_{t}$ . We first relate it to $C_{t}$ and $B_{t}$ using convexity of $\ell_{2}$ norm:

[TABLE]

As $\theta_{t}^{\star}$ is the minimizer of $\bar{g}_{t}$ , by strong convexity of $(\bar{g}_{t})_{t}$ ,

[TABLE]

while we also have

[TABLE]

The second inequalities holds because $\theta_{t-1}^{\star}$ is a minimizer of $\bar{g}_{t-1}$ and $g_{t}$ is $Q$ -Lipschitz, where $Q\triangleq R+R^{\prime}$ , using Lemma 2. Replacing (40) and (41) in (39) yields

[TABLE]

and we are left to show that $D_{t}\in{\mathcal{O}}(w_{t}^{2})$ to conclude. For this, we decompose the inequality from (Idet) into

[TABLE]

where the second inequality holds for the same reasons as in (41). Injecting (40) and (42) in (43), we obtain

[TABLE]

where we define $\tilde{D}_{t}\triangleq\frac{D_{t}}{w_{t}^{2}}$ . It is easy to show (see algebraic details in Appendix B) that the perturbation term $u(\tilde{D}_{t},\tilde{D}_{t-1})\in o(\tilde{D}_{t}+\tilde{D}_{t-1})$ if $\tilde{D}_{t}\to\infty$ . Using the determistictic result of Lemma 1, this ensures that $\tilde{D}_{t}$ is bounded, which combined with (40) allows to conclude.

A-A2 Stochastic decrease rates

In the general case (I), the inequalities (40), (41) and (42) holds, and (44) is replaced by

[TABLE]

Taking the expectation of this inequality and using Jensen inequality, we show that (43) holds when replacing $\tilde{D}_{t}$ by ${\mathbb{E}}[\tilde{D}_{t}]$ . This shows that ${\mathbb{E}}[D_{t}]\in{\mathcal{O}}(w_{t}^{2})$ and thus ${\mathbb{E}}[D_{t}]<\infty$ . The result follows from Lemma 1, that applies as $\mathcal{F}_{t-1}\subseteq\mathcal{F}_{t-\frac{1}{2}}\subseteq\mathcal{F}_{t}$ . ∎

A-B Convergence of samm — Proof of Proposition 3

We now proceed to prove the Proposition 3, that extends the stochastic majorization-minimization framework to allow approximations in both majorization and minimizations steps.

Proof of Proposition 3.

We adapt the proof of Proposition 3.3 from [10] (reproduced as Proposition 2 in our work). Relaxing tightness and majorizing hypotheseses introduces some extra error terms in the derivations. Assumption (H) allows to control these extra terms without breaking convergence. The stability Lemma 3 is important in steps 3 and 5.

A-B1 Almost sure convergence of $(\bar{g}_{t}(\theta_{t}))$

We control the positive expected variation of ${(g_{t}(\theta_{t}))}_{t}$ to show that it is a converging quasi-martingale. By construction of $\bar{g}_{t}$ and properties of the surrogates $g_{t}\in\mathcal{T}_{\rho,L}(f_{t},\theta_{t-1},\epsilon_{t})$ , where $\epsilon_{t}$ is a non-negative sequence that meets (H),

[TABLE]

where the average error sequence ${(\bar{\epsilon}_{t})_{t}}$ is defined recursively: $\bar{\epsilon}_{0}\triangleq\epsilon_{0}$ and $\bar{\epsilon}_{t}\triangleq(1-w_{t})\epsilon_{t-1}+w_{t}\epsilon_{t}$ . The first inequality uses $\bar{g}_{t}(\theta_{t})\leq\bar{g}_{t}(\theta_{t-1})$ . To obtain the forth inequality we observe $g_{t}(\theta_{t-1})-f_{t}(\theta_{t-1})<\epsilon_{t}$ by definition of $\epsilon_{t}$ and $\bar{f}_{t}(\theta_{t-1})-\bar{g}_{t}(\theta_{t-1})\leq\bar{\epsilon}_{t}$ , which can easily be shown by induction on $t$ . Then, taking the conditional expectation with respect to $\mathcal{F}_{t-1}$ ,

[TABLE]

We have used the fact that $\epsilon_{t-1}$ is deterministic with respect to $\mathcal{F}_{t-1}$ . To ensure convergence, we must bound both terms in (A-B1): the first term is the same as in the original proof with exact surrogate, while the second is the perturbative term introduced by the approximation sequence ${(\epsilon_{t})}_{t}$ . We use Lemma B.7 from [10], issued from the theory of empirical processes: ${\mathbb{E}}[\sup_{\theta\in\Theta}|f(\theta)-\bar{f}_{t-1}(\theta)|]=\mathcal{O}(w_{t}t^{1/2})$ , and thus

[TABLE]

where $C$ is a constant, as $t^{1/2}w_{t}^{2}=t^{1/2-2u}$ and $u>3/4$ from (G). Let us now focus on the second term of (A-B1). Defining, for all $1\leq i\leq t$ , $w_{i}^{t}=w_{i}\prod_{j=i+1}^{t}(1-w_{j})$ ,

[TABLE]

We set $\eta>0$ so that $2(u-1)-\eta>-1$ . Assumption (H) ensures ${\mathbb{E}}[\epsilon_{t}]\in\mathcal{O}(t^{2(u-1)-\eta})$ , which allows to bound the partial sum $\sum_{i=1}^{t}{\mathbb{E}}[\epsilon_{i}]\in\mathcal{O}(t^{2u-1-\eta})$ . Therefore

[TABLE]

where we use $u<1$ on the third line and the definition of ${(w_{t})}_{t}$ on the second line. Thus $\sum_{t=1}^{\infty}w_{t}{\mathbb{E}}[\bar{\epsilon}_{t-1}+{\mathbb{E}}[\epsilon_{t}|\mathcal{F}_{t-1}]]<\infty$ . We use quasi-martingale theory to conclude, as in [10]. We define the variable $\delta_{t}$ to be $1$ if ${\mathbb{E}}[\bar{g}_{t}(\theta_{t})-\bar{g}_{t-1}(\theta_{t-1})|\mathcal{F}_{t-1}]\geq 0$ , and [math] otherwise. As all terms of (A-B1) are positive:

[TABLE]

As $\bar{g}_{t}$ are bounded from below ( $\bar{f}_{t}$ is bounded from (D) and we easily show that $\bar{\epsilon}_{t}$ is bounded), we can apply Theorem A.1 from [10], that is a quasi-martingale convergence theorem originally found in [51]. It ensures that ${(g_{t}(\theta_{t}))}_{t\geq 1}$ converges almost surely to an integrable random variable $g^{\star}$ , and that $\sum_{t=1}^{\infty}{\mathbb{E}}[|{\mathbb{E}}[\bar{g}_{t}(\theta_{t})-\bar{g}_{t-1}(\theta_{t-1})|\mathcal{F}_{t-1}]|]<\infty$ almost surely.

A-B2 Almost sure convergence of $\bar{f}(\theta_{t})$

We rewrite the second inequality of (A-B1), adding $\bar{\epsilon}_{t}$ on both sides:

[TABLE]

where the left side bound has been obtained in the last paragraph by induction and the right side bound arises from the definition of $\epsilon_{t}$ . Taking the expectation of (A-B2) conditioned on $\mathcal{F}_{t-1}$ , almost surely,

[TABLE]

We separately study the three terms of the previous upper bound. The first two terms can undergo the same analysis as in [10]. First, almost sure convergence of $\sum_{t=1}^{\infty}{\mathbb{E}}\big{[}|{\mathbb{E}}[\bar{g}_{t}(\theta_{t})-\bar{g}_{t-1}(\theta_{t-1})|\mathcal{F}_{t-1}]|\big{]}$ implies that ${\mathbb{E}}\big{[}\bar{g}_{t}(\theta_{t})-\bar{g}_{t-1}(\theta_{t-1})|\mathcal{F}_{t-1}\big{]}$ is the summand of an almost surely converging sum. Second, $w_{t}\big{(}f(\theta_{t-1})-\bar{f}_{t-1}(\theta_{t-1})\big{)}$ is the summand of an absolutely converging sum with probability one, less it would contradict (48). To bound the third term, we have once more to control the perturbation introduced by ${(\epsilon_{t})}_{t}$ . We have $\sum_{t=1}^{\infty}w_{t}\bar{\epsilon}_{t-1}+w_{t}{\mathbb{E}}[\epsilon_{t}|\mathcal{F}_{t-1}]<\infty$ almost surely, otherwise Fubini’s theorem would invalidate (A-B1).

As the three terms are the summand of absolutely converging sums, the positive term $w_{t}(\bar{g}_{t-1}(\theta_{t-1})-\bar{f}_{t-1}(\theta_{t-1})+\bar{\epsilon}_{t-1})$ is the summand of an almost surely convergent sum. This is not enough to prove that $\bar{h}_{t}(\theta_{t})\triangleq\bar{g}_{t}(\theta_{t})-\bar{f}_{t}(\theta_{t})\to_{\infty}0$ , hence we follow [10] and make use of its Lemma A.6. We define $X_{t}\triangleq\bar{h}_{t-1}(\theta_{t-1})+\bar{\epsilon}_{t-1}$ . As (H) holds, we use Lemma 3, which ensures that ${(\bar{h}_{t})}_{t\geq 1}$ are uniformly $R^{\prime}$ -Lipschitz and $\|\theta_{t}-\theta_{t-1}\|_{2}=\mathcal{O}(w_{t})$ . Hence,

[TABLE]

From assumption (H), $(\epsilon_{t})_{t}$ and $(\bar{\epsilon}_{t})_{t}$ are bounded. Therefore $|\bar{\epsilon}_{t}-\bar{\epsilon}_{t-1}|\leq w_{t}(|\epsilon_{t}|+|\bar{\epsilon}_{t-1}|)\in\mathcal{O}(w_{t})$ and hence

[TABLE]

Lemma A.6 from [10] then ensures that $X_{t}$ converges to zero with probability one. Assumption (H) ensures that $\epsilon_{t}\to_{\infty}~{}0$ almost surely, from which we can easily deduce $\bar{\epsilon}_{t}\to_{\infty}0$ almost surely. Therefore $\bar{h}_{t}(\theta_{t})\to 0$ with probability one and ${(\bar{f}_{t}(\theta_{t}))}_{t\geq 1}$ converges almost surely to $g^{\star}$ .

A-B3 Almost sure convergence of $\bar{f}(\theta_{t})$

Lemma B.7 of [10], based on empirical process theory [33], ensures that $\bar{f}_{t}$ uniformly converges to $\bar{f}$ . Therefore, ${(\bar{f}(\theta_{t}))}_{t\geq 1}$ converges almost surely to $g^{\star}$ .

A-B4 Asymptotic stationary point condition

Preliminary to the final result, we establish the asymptotic stationary point condition (57) as in [10]. This requires to adapt the original proof to take into account the errors in surrogate computation and minimization. We set $\alpha>0$ . By definition, $\nabla\bar{h}_{t}$ is $L$ -Lipschitz over ${\mathbb{R}}^{K}$ . Following the same computation as in (34), we obtain, for all $\alpha>0$ ,

[TABLE]

where we use $|\bar{h}_{t}(\theta)|\leq\bar{\epsilon}_{t}$ for all $\theta\in{\mathbb{R}}^{K}$ . As $\bar{\epsilon}_{t}\to 0$ and the inequality (56) is true for all $\alpha$ , $\|\nabla\bar{h}_{t}(\theta_{t})\|_{2}\to_{\infty}0$ almost surely. From the strong convexity of $\bar{g}_{t}$ and Lemma 3, $\|\theta_{t}-\theta_{t}^{\star}\|_{2}$ converges to zero, which ensures

[TABLE]

A-B5 Parametrized surrogates

We use assumption (F) to finally prove the property, adapting the proof of Proposition 3.4 in [10]. We first recall the derivations of [10] for obtaining (58) We define $(\boldsymbol{\kappa}_{t})_{t}$ such that $\bar{g}_{t}=g_{\boldsymbol{\kappa}_{t}}$ for all $t>0$ . We assume that $\theta_{\infty}$ is a limit point of ${(\theta_{t})}_{t}$ . As $\Theta$ is compact, there exists an increasing sequence $(t_{k})_{k}$ such that $(\theta_{t_{k}})_{k}$ converges toward $\theta_{\infty}$ . As $\mathcal{K}$ is compact, a converging subsequence of $(\boldsymbol{\kappa}_{t_{k}})_{k}$ can be extracted, that converges towards $\boldsymbol{\kappa}_{\infty}\in\mathcal{K}$ . From the sake of simplicity, we drop subindices and assume without loss of generality that $\theta_{t}\to\theta_{\infty}$ and $\boldsymbol{\kappa}_{t}\to\boldsymbol{\kappa}_{\infty}$ . From the compact parametrization assumption, we easily show that ${(\bar{g}_{\boldsymbol{\kappa}_{t}})}_{t}$ uniformly converges towards $\bar{g}_{\infty}\triangleq\bar{g}_{\boldsymbol{\kappa}_{\infty}}$ . Then, defining $\bar{h}_{\infty}=\bar{g}_{\infty}-\bar{f}$ , for all $\theta\in\Theta$ ,

[TABLE]

We first show that $\nabla\bar{f}(\theta_{\infty},\theta-\theta_{\infty})\geq 0$ for all $\theta\in\Theta$ . We consider the sequence ${(\theta_{t}^{\star})}_{t}$ . From Lemma 3, $\|\theta_{t}-\theta_{t}^{\star}\|_{2}\to 0$ , which implies $\theta_{t}^{\star}\to\theta_{\infty}$ . $\bar{g}_{t}$ converges uniformly towards $\bar{g}_{\infty}$ , which implies ${(\bar{g}_{t}(\theta_{t}^{\star}))}_{t}\to\bar{g}_{\infty}(\theta_{\infty})$ . Furthermore, as $\theta_{t}^{\star}$ minimizes $\bar{g}_{t}$ , for all $t>0$ and $\theta\in\Theta$ , $\bar{g}_{t}(\theta_{t}^{\star})\leq\bar{g}_{t}(\theta)$ . This implies $\bar{g}_{\infty}(\theta_{\infty})\leq\inf_{\theta\in\Theta}\bar{g}_{\infty}(\theta)$ by taking the limit for $t\to\infty$ . Therefore $\theta_{\infty}$ is the minimizer of $\bar{g}_{\infty}$ and thus $\nabla\bar{g}_{\infty}(\theta_{\infty},\theta-\theta_{\infty})\geq 0$ .

Adapting [10], we perform the first-order expansion of $\bar{h}_{t}$ around $\theta_{t}^{\star}$ (instead of $\theta_{t}$ in the original proof) and show that $\nabla\bar{h}_{\infty}(\theta_{\infty},\theta-\theta_{\infty})=0$ , as $\bar{h}_{t}$ differentiable, $\|\nabla\bar{h}_{t}(\theta_{t}^{\star})\|_{2}\to 0$ and $\theta_{t}^{\star}\to\theta_{\infty}$ . This is sufficient to conclude. ∎

A-C Convergence of somf — Proof of Proposition 1

Proof of Proposition 1.

From assumption (D), ${(x_{t})}_{t}$ is $\ell_{2}$ -bounded by a constant $X$ . With assumption (A), it implies that ${(\boldsymbol{\alpha}_{t})}_{t}$ is $\ell_{2}$ -bounded by a constant $A$ . This is enough to show that $(g_{t})_{t}$ and $(\theta_{t})_{t}$ meet basic assumptions (C)–(F). Assumption (G) immediately implies (B). It remains to show that $(g_{t})_{t}$ and $(\theta_{t})_{t}$ meet the assumptions (H) and (I). This will allow to cast somf as an instance of samm and conclude.

A-C1 The computation of ${\mathbf{D}}_{t}$ verifies (I)

We define ${\mathbf{D}}_{t}^{\star}=\operatornamewithlimits{\mathrm{argmin}}_{{\mathbf{D}}\in\mathcal{C}}\bar{g}_{t}({\mathbf{D}})$ . We show that performing subsampled block coordinate descent on $\bar{g}_{t}$ is sufficient to meet assumption (I), where $\theta_{t}={\mathbf{D}}_{t}$ . We separately analyse the exceptional case where no subsampling is done and the general case.

First, with small but non-zero probability, ${\mathbf{M}}_{t}={\mathbf{I}}_{p}$ and Alg. 4 performs a single pass of simple block coordinate descent on $\bar{g}_{t}$ . In this case, as $\bar{g}_{t}$ is strongly convex from (A), [52, 37] ensures that the sub-optimality decreases at least of factor $1-\mu$ with a single pass of block coordinate descent, where $\mu>0$ is a constant independent of $t$ . We provide an explicit $\mu$ in Appendix B.

In the general case, the function value decreases deterministically at each minimization step: $\bar{g}_{t}({\mathbf{D}}_{t})\leq\bar{g}_{t}({\mathbf{D}}_{t-1})$ . As a consequence, ${\mathbb{E}}[\bar{g}_{t}({\mathbf{D}}_{t})|\mathcal{F}_{t-\frac{1}{2}},{\mathbf{M}}_{t}\neq{\mathbf{I}}_{p}]\leq\bar{g}_{t}({\mathbf{D}}_{t-1})$ . Furthermore, $\bar{g}_{t}$ and hence $\bar{g}_{t}({\mathbf{D}}_{t}^{\star})$ are deterministic with respect to $\mathcal{F}_{t-\frac{1}{2}}$ , which implies ${\mathbb{E}}[\bar{g}_{t}({\mathbf{D}}_{t}^{\star})|\mathcal{F}_{t-\frac{1}{2}},{\mathbf{M}}_{t}\neq{\mathbf{I}}_{p}]=\bar{g}_{t}({\mathbf{D}}_{t}^{\star})$ . Defining $d\triangleq{\mathbb{P}}[{\mathbf{M}}_{t}={\mathbf{I}}_{p}]$ , we split the sub-optimality expectation and combine the analysis of both cases:

[TABLE]

A-C2 The surrogates ${(g_{t})}_{t}$ verify (H)

We define $g_{t}^{\star}\in\mathcal{S}_{\rho,L}(f_{t},{\mathbf{D}}_{t-1})$ the surrogate used in omf at iteration $t$ , which depends on the exact computation of $\boldsymbol{\alpha}_{t}^{\star}$ , while the surrogate $g_{t}$ used in somf relies on approximated $\boldsymbol{\alpha}_{t}$ . Formally, using the loss function $\ell(\boldsymbol{\alpha},{\mathbf{G}},\boldsymbol{\beta})\triangleq\frac{1}{2}\boldsymbol{\alpha}^{\top}{\mathbf{G}}\boldsymbol{\alpha}-\boldsymbol{\alpha}^{\top}\boldsymbol{\beta}+\lambda\Omega(\boldsymbol{\alpha})$ , we recall the definitions

[TABLE]

The matrices ${\mathbf{G}}_{t}^{\star}$ , $\boldsymbol{\beta}_{t}^{\star}$ are defined in (21) and ${\mathbf{G}}_{t}$ , $\boldsymbol{\beta}_{t}$ in either the update rules (b) or (c). We define $\epsilon_{t}\triangleq\|g_{t}^{\star}-g_{t}\|_{\infty}$ to be the $\ell_{\infty}$ difference between the approximate surrogate of somf and the exact surrogate of omf, as illustrated in Figure 2. By definition, $g_{t}\in\mathcal{T}_{\rho,L}(f_{t},\theta_{t-1},\epsilon_{t})$ . We first show that $\epsilon_{t}$ can be bounded by the Froebenius distance between the approximate parameters ${\mathbf{G}}_{t}$ , $\boldsymbol{\beta}_{t}$ and the exact parameters ${\mathbf{G}}_{t}^{\star},\boldsymbol{\beta}_{t}^{\star}$ . Using Cauchy-Schwartz inequality, we first show that there exists a constant $C^{\prime}>0$ such that for all ${\mathbf{D}}\in\mathcal{C}$ ,

[TABLE]

Then, we show that the distance ${\|\boldsymbol{\alpha}_{t}-\boldsymbol{\alpha}_{t}^{*}\|}_{2}$ can itself be bounded: there exists $C^{\prime\prime}>0$ constant such that

[TABLE]

We combine both equations and take the supremum over ${\mathbf{D}}\in~{}\mathcal{C}$ , yielding

[TABLE]

where $C$ is constant. Detailed derivation of (61) to (63) relies on assumption (A) and are reported in Appendix B.

In a second step, we show that $\|{\mathbf{G}}_{t}^{\star}-{\mathbf{G}}_{t}\|_{F}$ and $\|\boldsymbol{\beta}_{t}^{\star}-\boldsymbol{\beta}_{t}\|_{2}$ vanish almost surely, sufficiently fast. We focus on bounding $\|\boldsymbol{\beta}_{t}-\boldsymbol{\beta}_{t}^{\star}\|_{2}$ and proceed similarly for $\|{\mathbf{G}}_{t}-{\mathbf{G}}_{t}^{\star}\|_{2}$ when the update rules (b) are used. For $t>0$ , we write $i\triangleq i_{t}$ . Then

[TABLE]

where $\gamma_{s,t}^{(i)}=\gamma_{c^{(i)}_{t}}\prod_{s<t,{\mathbf{x}}_{s}={\mathbf{x}}^{(i)}}(1-\gamma_{c^{(i)}_{s}})$ and $c^{(i)}_{t}=\left|\left\{s\leq t,{\mathbf{x}}_{s}={\mathbf{x}}^{(i)}\right\}\right|$ . We can then decompose $\boldsymbol{\beta}_{t}-\boldsymbol{\beta}_{t}^{\star}$ as

[TABLE]

The latter equation is composed of two terms: the first one captures the approximation made by using old dictionaries in the computation of ${(\boldsymbol{\beta}_{t})}_{t}$ , while the second captures how the masking effect is averaged out as the number of epochs increases. Assumption (B) allows to bound both terms at the same time. Setting $\eta\triangleq\frac{1}{2}\min\big{(}v-\frac{3}{4},(3u-2)-v\big{)}>0$ , a tedious but elementary derivation indeed shows ${\mathbb{E}}[\|\boldsymbol{\beta}_{t}-\boldsymbol{\beta}_{t}^{\star}\|_{2}]\in\mathcal{O}(t^{2(u-1)-\eta})$ and $\epsilon_{t}\to 0$ almost surely — see Appendix B. The somf algorithm therefore meets assumption (H) and is a convergent samm algorithm. Proposition 1 follows.∎

Appendix B Algebraic details

B-A Proof of Lemma 1

Proof.

We first focus on the deterministic case. Assume that ${(x_{t})}_{t}$ is not bounded. Then there exists a subsequence of ${(x_{t})}_{t}$ that diverges towards $+\infty$ . We assume without loss of generality that ${(x_{t})}_{t}\to\infty$ . Then, $x_{t}+x_{t-1}\to\infty$ and for all $\epsilon>0$ , using the asymptotic bounds on $u$ , there exists $t_{1}\geq t_{0}$ such that

[TABLE]

Setting $\epsilon$ small enough, we obtain that $x_{t}$ is bounded by a geometrically decreasing sequence after $t_{1}$ , and converges to [math], which contradicts our hypothesis. This is enough to conclude.

In the random case, we consider a realization of ${(X_{t})}_{t}$ that is not bounded, and assumes without loss of generality that it diverges to $+\infty$ . Following the reasoning above, there exists $\beta<1$ , $t_{1}>0$ , such that for all $t>t_{1}$ , ${\mathbb{E}}[X_{t}|\mathcal{F}_{t^{\prime}}]\leq\beta X_{t-1}$ , where $\mathcal{F}_{t-1}\subseteq\mathcal{F}_{t^{\prime}}\subseteq\mathcal{F}_{t}$ . Taking the expectation conditioned on $\mathcal{F}_{t-1}$ , ${\mathbb{E}}[X_{t}|\mathcal{F}_{t-1}]\leq\beta X_{t-1}$ , as $X_{t-1}$ is deterministic conditioned on $\mathcal{F}_{t-1}$ . Therefore $X_{t}$ is a supermartingale beyond a certain time. As ${\mathbb{E}}[X_{t}]<\infty$ , Doob’s forward convergence lemma on discrete martingales [53] ensures that ${(X_{t})}_{t}$ converges almost surely. Therefore the event $\{{(X_{t})}_{t}\>\text{is not bounded}\}$ cannot happen on a set with non-zero probability, less it would lead to a contradiction. The lemma follows.∎

B-B Taylor’s inequality for $L$ -Lipschitz continuous functions

This inequality is useful in the demonstration of Lemma 2 and Proposition 3. Let $f:\Theta\subset{\mathbb{R}}^{K}\to{\mathbb{R}}$ be a function with $L$ -Lipschitz gradient. That is, for all $\theta,\theta^{\prime}\in\Theta,{\|\nabla f(\theta)-\nabla f(\theta^{\prime})\|}_{2}\leq L{\|\theta-\theta^{\prime}\|}_{2}$ . Then, for all $\theta,\theta^{\prime}\in\Theta$ ,

[TABLE]

B-C *Lemma 3:

Detailed control of $D_{t}$ in (44)*

Injecting (40) and (42) in (43), we obtain

[TABLE]

From assumption (G), $\frac{w^{2}_{t-1}}{w^{2}_{t}}\to 1$ , and we have, from elementary comparisons, that $u(\tilde{D}_{t},\tilde{D}_{t-1})\in o(\tilde{D}_{t}+\tilde{D}_{t-1})$ if ${D_{t}\to\infty}$ . Using the determistictic result of Lemma 1, this ensures that $\tilde{D}_{t}$ is bounded.

B-D Detailed derivations in the proof of Proposition 1

Let us first exhibit a scaler $\mu>0$ independent of $t$ , for which (I) is met

B-D1 Geometric rate for single pass subsampled block coordinate descent

. For ${\mathbf{D}}^{(j)}\in{\mathbb{R}}^{p\times k}$ any matrix with non-zero $j$ -th column ${\mathbf{d}}^{(j)}$ and zero elsewhere

[TABLE]

and hence $\bar{g}_{t}$ gradient has component Lipschitz constant $L_{j}=\bar{\mathbf{C}}_{t}[j,j]$ for component $j$ , as already noted in [15]. Using [37] terminology, $\nabla\bar{g}_{t}$ has coordinate Lipschitz constant $L_{\mathrm{max}}\triangleq\max_{0\leq j<k}\bar{\mathbf{C}}_{t}[j,j]\leq\max_{t>0,0\leq j<k}\boldsymbol{\alpha}_{t}[j]^{2}\leq A^{2}$ , as $(\boldsymbol{\alpha}_{t})_{t}$ is bounded from (A). As a consequence, $\bar{g}_{t}$ gradient is also $L$ -Lipschitz continuous, where [37] note that $L\leq\sqrt{k}L_{\mathrm{max}}$ . Moreover, $\bar{g}_{t}$ is strongly convex with strong convexity modulus $\rho>0$ by hypothesis (A). Then, [52] ensures that after one cycle over the $k$ blocks

[TABLE]

B-D2 Controling $\epsilon_{t}$ from $({\mathbf{G}}_{t},\boldsymbol{\beta}_{t}),({\mathbf{G}}_{t}^{\star},\boldsymbol{\beta}_{t}^{\star})$ — Equations 61–62

We detail the derivations that are required to show that (H) is met in the proof of somf convergence. We first show that $(\boldsymbol{\alpha}_{t})_{t}$ is bounded. We choose $D>0$ such that ${\|{\mathbf{d}}^{(j)}\|}_{2}\leq D$ for all $j\in[k]$ and ${\mathbf{D}}\in\mathcal{C}$ , and $X$ such that ${\|{\mathbf{x}}\|}_{2}\leq X$ for all ${\mathbf{x}}\in\mathcal{X}$ . From assumption (A), using the second-order growth condition, for all $t>0$ ,

[TABLE]

We have successively used the fact that $\Omega(0)=0$ , $\Omega(\boldsymbol{\alpha}_{t})\geq 0$ , and ${\|\boldsymbol{\beta}_{t}\|}_{2}\leq\sqrt{k}rDX$ , which can be shown by a simple induction on the number of epochs. For all $t>0$ , from the definition of $\boldsymbol{\alpha}_{t}$ and $\boldsymbol{\alpha}_{t}^{\star}$ , for all ${\mathbf{D}}\in\mathcal{C}$ :

[TABLE]

where we use Cauchy-Schwartz inequality and elementary bounds on the Froebenius norm for the first inequality, and use $\boldsymbol{\alpha}_{t},\boldsymbol{\alpha}_{t}^{\star}\leq A$ , ${\mathbf{x}}_{t}\leq X$ for all $t>0$ and ${\mathbf{d}}^{(j)}\leq D$ for all $j\in[k]$ to obtain the second inequality, which is (61) in the main text.

We now turn to control ${\|\boldsymbol{\alpha}_{t}-\boldsymbol{\alpha}_{t}^{\star}\|}_{2}$ . We adapt the proof of Lemma B.6 from [36], that states the lipschitz continuity of the minimizers of some parametrized functions. By definition,

[TABLE]

Assumption (A) ensures that ${\mathbf{G}}_{t}\succ\rho{\mathbf{I}}_{k}$ , therefore we can write the second-order growth condition

[TABLE]

$p$ takes a simple form and can differentiated with respect to $\boldsymbol{\alpha}$ . For all $\boldsymbol{\alpha}\in{\mathbb{R}}^{k}$ such that ${\|\boldsymbol{\alpha}\|}_{2}\leq A$ ,

[TABLE]

Therefore $p$ is $L$ -Lipschitz on the ball of size $A$ where $\boldsymbol{\alpha}_{t}$ and $\boldsymbol{\alpha}_{t}^{\star}$ live, and

[TABLE]

which is (62) in the main text. The bound (63) on $\epsilon_{t}$ immediately follows.

B-D3 Bounding ${\|\boldsymbol{\beta}_{t}-\boldsymbol{\beta}_{t}^{\star}\|}_{2}$

in equation (A-C2)

Taking the $\ell_{2}$ norm in (A-C2), we have ${\|\boldsymbol{\beta}_{t}-\boldsymbol{\beta}_{t}^{\star}\|}_{2}\leq BL_{t}+CR_{t}$ , where $B$ and $C$ are positive constants independent of $t$ and we introduce the terms

[TABLE]

Conditioning on the sequence of drawn indices

We recall that ${(i_{t})}_{t}$ is the sequence of indices that are used to draw ${({\mathbf{x}}_{t})}_{t}$ from ${\{{\mathbf{x}}^{(i)}\}}_{i}$ , namely such that ${\mathbf{x}}_{t}={\mathbf{x}}^{(i_{t})}$ . ${(i_{t})}_{t}$ is a sequence of i.i.d random variables, whose law is uniform in $[1,n]$ . For each $i\in[n]$ , we define the increasing sequence ${(t_{b}^{(i)})}_{b>0}$ that record the iterations at which sample $(i)$ is drawn, i.e. such that $i_{t_{b}}=i$ for all $b>0$ . For $t>0$ , we recall that $c_{t}^{(i)}>0$ is the integer that counts the number of time sample $(i)$ has appeared in the algorithm, i.e. $c_{t}^{(i)}=\max\,\{b>0,t_{b}^{(i)}\leq t\}$ . These notations will help us understanding the behavior of ${(L_{t})}_{t}$ and ${(R_{t})}_{t}$ .

Bounding $R_{t}$

The right term $R_{t}$ takes its value into sequences that are running average of masking matrices. Formally, $R_{t}={\|\bar{\mathbf{M}}_{t}^{(i_{t})}-{\mathbf{I}}\|}_{F}$ , where we define for all $i\in[n]$ ,

[TABLE]

When sampling a sequence of indices $(i_{s})_{s>0}$ , the $n$ random matrix sequences ${[{(\bar{\mathbf{M}}_{t}^{(i)})}_{t\leq 0}]}_{i\in[n]}$ follows the same probability law as the sampling is uniform. We therefore focus on controling ${(\bar{\mathbf{M}}_{t}^{(0)})}_{t}$ . For simplicity, we write $c_{t}\triangleq c_{t}^{(0)}$ . When ${\mathbb{E}}[\cdot]$ is the expectation over the sequence of indices $(i_{s})_{s}$ ,

[TABLE]

We have simply bounded the Froebenius norm by the $\ell_{1}$ norm in the first inequality and used the fact that all coefficients ${\mathbf{M}}_{t}[j,j]$ follows the same Bernouilli law for all $t>0$ , $j\in[p]$ . We then used Lemma B.7 from [10] for the last inequality. This lemma applies as ${\mathbf{M}}_{t}[0,0]$ follows the recursion (80). It remains to take the expectation of (B-D3), over all possible sampling trajectories $(i_{s})_{s>0}$ :

[TABLE]

The last inequality arises from the definition of $\eta\triangleq\frac{1}{2}\min\big{(}v-\frac{3}{4},(3u-2)-v\big{)}$ , as follows. First, $\eta>0$ as $u>\frac{11}{12}$ . Then, we successively have

[TABLE]

Lemma B.7 from [10] also ensures that ${\mathbf{M}}_{t}[0,0]\to 1$ almost surely when $t\to\infty$ . Therefore ${(\bar{\mathbf{M}}_{t}^{(0)}-{\mathbf{I}}})_{t}$ converges towards [math] almost surely, given any sample sequence $(i_{s})_{s}$ . It thus converges almost surely when all random variables of the algorithm are considered. This is also true for ${(\bar{\mathbf{M}}_{t}^{(i)}-{\mathbf{I}})}_{t}$ for all $i\in[n]$ and hence for $R_{t}$ .

Bounding $L_{t}$

As above, we define $n$ sequences ${[{(L_{t}^{(i)})}_{t}]}_{i\in[n]}$ , such that $L_{t}=L_{t}^{(i_{t})}$ for all $t>0$ . Namely,

[TABLE]

Once again, the sequences $\big{[}{(L_{t}^{(i)})}_{t}\big{]}_{i}$ all follows the same distribution when sampling over sequence of indices $(i_{s})_{s}$ . We thus focus on bounding ${(L_{t}^{(0)})}_{t}$ . Once again, we drop the $(0)$ superscripts in the right expression for simplicity. We set $\nu\triangleq 3u-2-\eta$ . From assumption (B) and the definition of $\eta$ , we have $v<\nu<1$ . We split the sum in two parts, around index $d_{t}\triangleq c_{t}-\lfloor{(c_{t})}^{\nu}\rfloor$ , where $\lfloor\cdot\rfloor$ takes the integer part of a real number. For simplicity, we write $d\triangleq d_{t}$ and $c\triangleq c_{t}$ in the following.

[TABLE]

On the left side, we have bounded ${\|{\mathbf{D}}_{t}\|}_{F}$ by $\sqrt{k}D$ , where $D$ is defined in the previous section. The right part uses the bound on ${\|{\mathbf{D}}_{s}-{\mathbf{D}}_{t}\|}_{F}$ provided by Lemma 3, that applies here as (I) is met and (63) ensures that ${({\|g_{t}-g_{t}^{\star}\|}_{\infty})}_{t}$ is bounded.

We now study both $L_{t,1}^{(0)}$ and $L_{t,2}^{(0)}$ . First, for all $t>0$ ,

[TABLE]

where $C$ and $C^{\prime}$ are constants independent of $t$ . We have used $\nu>v$ for the third inequality, which ensures that $\exp\big{(}{\log(1-\frac{1}{c^{v}})c^{\nu}}\big{)}\in\mathcal{O}({c^{\nu-v}})$ . Basic asymptotic comparison provides the last inequality, as $c_{t}\to\infty$ almost surely and the right term decays exponentially in ${(c_{t})}_{t}$ , while the left decays polynomially. As a consequence, $L_{t,1}^{(0)}\to 0$ almost surely.

Secondly, the right term can be bounded as ${(w_{t})}_{t}$ decays sufficiently rapidly. Indeed, as $\sum_{b=1}^{c}\gamma_{t_{b},t}=1$ , we have

[TABLE]

from elementary comparisons. First, we use the definition of $\nu$ to draw

[TABLE]

were we use the fast that $\eta-1<0$ . We note that for all $b>0$ , $t_{b+1}-t_{b}$ follows a geometric law of parameter $\frac{1}{n}$ , and expectation $n$ . Therefore, as $c-d\to\infty$ when $t\to 0$ , from the strong law of large numbers and linearity of the expectation

[TABLE]

As a consequence, $\frac{t_{c}-t_{d}}{c_{t}-d_{t}}(\frac{d_{t}}{t_{d}})^{u}\to n^{1-u}$ almost surely. This immediately shows $L_{t,2}^{(0)}\to 0$ and thus $L_{t}^{(0)}\to 0$ almost surely. As with $R_{t}$ , this implies that $L_{t}\to 0$ almost surely and therefore

[TABLE]

Finally, from the dominated convergence theorem, ${\mathbb{E}}[\frac{t_{c}-t_{d}}{c_{t}-d_{t}}(\frac{d_{t}}{t_{d}})^{u}]\to n^{1-u}$ for $t\to\infty$ . We can use Cauchy-Schartz inequality and write

[TABLE]

where $C^{\prime}$ is a constant independant of $t$ . Then

[TABLE]

Combined with (B-D3), this shows that ${\mathbb{E}}[{\|\boldsymbol{\beta}_{t}-\boldsymbol{\beta}_{t}^{\star}\|}_{2}]\in\mathcal{O}(({c_{t})}^{2(u-1)-\eta})$ . As $c_{t}$ follows a binomial distribution of parameter $(t,\frac{1}{n})$ , $\frac{c_{t}}{t}\to\frac{1}{n}$ almost surely when $t\to 0$ . Therefore ${\mathbb{E}}[(\frac{c_{t}}{t})^{2(u-1)-\eta})]\to n^{\eta-2(u-1)}$ , and from Cauchy-Schwartz inequality,

[TABLE]

We have reused the fact that converging sequences are bounded. This is enough to conclude.

Bibliography53

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. Mairal, “Sparse Modeling for Image and Vision Processing,” Foundations and Trends in Computer Graphics and Vision , vol. 8, no. 2-3, pp. 85–283, 2014.
2[2] N. Srebro, J. Rennie, and T. S. Jaakkola, “Maximum-margin matrix factorization,” in Advances in Neural Information Processing Systems , 2004, pp. 1329–1336.
3[3] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational Mathematics , vol. 9, no. 6, pp. 717–772, 2009.
4[4] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global Vectors for Word Representation.” in Proc. Conf. EMNLP , vol. 14, 2014, pp. 1532–43.
5[5] O. Levy and Y. Goldberg, “Neural word embedding as implicit matrix factorization,” in Advances in Neural Information Processing Systems , 2014, pp. 2177–2185.
6[6] Y. Zhang, M. Roughan, W. Willinger, and L. Qiu, “Spatio-Temporal Compressive Sensing and Internet Traffic Matrices,” 2009.
7[7] H. Kim and H. Park, “Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis,” Bioinformatics , vol. 23, no. 12, pp. 1495–1502, 2007.
8[8] G. Varoquaux et al. , “Multi-subject dictionary learning to segment an atlas of brain spontaneous activity,” in Proc. IPMI Conf. , 2011, pp. 562–573.