Supervised Stochastic Gradient Algorithms for Multi-Trial Source Separation

Ronak Mehta; Mateus Piovezan Otto; Noah Stanis; Azadeh Yazdan-Shahmorad; Zaid Harchaoui

arXiv:2508.20618·cs.LG·August 29, 2025

Supervised Stochastic Gradient Algorithms for Multi-Trial Source Separation

Ronak Mehta, Mateus Piovezan Otto, Noah Stanis, Azadeh Yazdan-Shahmorad, Zaid Harchaoui

PDF

Open Access

TL;DR

This paper introduces a supervised stochastic gradient algorithm for multi-trial source separation, enhancing success rates and interpretability by integrating supervision and joint learning in ICA.

Contribution

It presents a novel supervised stochastic gradient method for ICA that combines proximal gradient updates with backpropagation for joint learning.

Findings

01

Increased success rate of non-convex optimization due to supervision

02

Improved interpretability of independent components

03

Validated on synthetic and real data experiments

Abstract

We develop a stochastic algorithm for independent component analysis that incorporates multi-trial supervision, which is available in many scientific contexts. The method blends a proximal gradient-type algorithm in the space of invertible matrices with joint learning of a prediction model through backpropagation. We illustrate the proposed algorithm on synthetic and real data experiments. In particular, owing to the additional supervision, we observe an increased success rate of the non-convex optimization and the improved interpretability of the independent components.

Equations202

z = As .

z = As .

W, θ min \frac{1}{N} i = 1 \sum N [ℓ_{0} (W, z_{i}) + λ m = 1 \sum M ℓ_{m} (W_{m \cdot}^{⊤} z_{i}, y_{i, m}, θ_{m})],

W, θ min \frac{1}{N} i = 1 \sum N [ℓ_{0} (W, z_{i}) + λ m = 1 \sum M ℓ_{m} (W_{m \cdot}^{⊤} z_{i}, y_{i, m}, θ_{m})],

ℓ_{0} (W, z) = L (W) + \frac{1}{T} c = 1 \sum C t = 1 \sum T g ([Wz]_{c, t}),

ℓ_{0} (W, z) = L (W) + \frac{1}{T} c = 1 \sum C t = 1 \sum T g ([Wz]_{c, t}),

c = 1 \sum C t = 1 \sum T g ([Wz]_{c, t}) = u \geq 0_{C \times T} min c = 1 \sum C t = 1 \sum T \frac{1}{2} u_{c, t} [Wz]_{c, t}^{2} + f (u_{c, t}),

c = 1 \sum C t = 1 \sum T g ([Wz]_{c, t}) = u \geq 0_{C \times T} min c = 1 \sum C t = 1 \sum T \frac{1}{2} u_{c, t} [Wz]_{c, t}^{2} + f (u_{c, t}),

F (W, θ, U) := L (W) + \frac{1}{N} i = 1 \sum N [c = 1 \sum C t = 1 \sum T \frac{1}{2} U_{i, c, t} [W z_{i}]_{c, t}^{2} + f (U_{i, c, t}) + λ m = 1 \sum M ℓ_{m} (W_{m \cdot}^{⊤} z_{i}, y_{i, m}, θ_{m})],

F (W, θ, U) := L (W) + \frac{1}{N} i = 1 \sum N [c = 1 \sum C t = 1 \sum T \frac{1}{2} U_{i, c, t} [W z_{i}]_{c, t}^{2} + f (U_{i, c, t}) + λ m = 1 \sum M ℓ_{m} (W_{m \cdot}^{⊤} z_{i}, y_{i, m}, θ_{m})],

U^{(k)} \leftarrow U \geq 0_{N \times C \times T} arg min F (W^{(k - 1)}, θ^{(k - 1)}, U) .

U^{(k)} \leftarrow U \geq 0_{N \times C \times T} arg min F (W^{(k - 1)}, θ^{(k - 1)}, U) .

θ_{m}^{(k)} \leftarrow GradientBasedUpdate (θ_{m}^{(k - 1)}, η_{p}, g^{(k - 1)})

θ_{m}^{(k)} \leftarrow GradientBasedUpdate (θ_{m}^{(k - 1)}, η_{p}, g^{(k - 1)})

\displaystyle\mathbf{g}_{m}^{(k)}:=\nabla_{\boldsymbol{\theta}_{m}}F(\mathbf{W}^{(k-1)},\boldsymbol{\theta},\mathbf{U}^{(k)})\big{|}_{\boldsymbol{\theta}_{m}=\boldsymbol{\theta}^{(k-1)}_{m}}=\frac{1}{N}\sum_{i=1}^{N}\nabla_{\boldsymbol{\theta}_{m}}\ell_{m}((\mathbf{W}^{(k-1)}_{m\cdot})^{\top}\mathbf{z}_{i},y_{i,m},\boldsymbol{\theta}_{m})\big{|}_{\boldsymbol{\theta}_{m}=\boldsymbol{\theta}^{(k-1)}_{m}}.

\displaystyle\mathbf{g}_{m}^{(k)}:=\nabla_{\boldsymbol{\theta}_{m}}F(\mathbf{W}^{(k-1)},\boldsymbol{\theta},\mathbf{U}^{(k)})\big{|}_{\boldsymbol{\theta}_{m}=\boldsymbol{\theta}^{(k-1)}_{m}}=\frac{1}{N}\sum_{i=1}^{N}\nabla_{\boldsymbol{\theta}_{m}}\ell_{m}((\mathbf{W}^{(k-1)}_{m\cdot})^{\top}\mathbf{z}_{i},y_{i,m},\boldsymbol{\theta}_{m})\big{|}_{\boldsymbol{\theta}_{m}=\boldsymbol{\theta}^{(k-1)}_{m}}.

θ_{m}^{(k)} = (1 - η_{p} μ) θ_{m}^{(k - 1)} - η_{p} λ \frac{m _{m}^{(k)}}{( v _{m}^{(k)} ) ^{1/2} + ϵ}

θ_{m}^{(k)} = (1 - η_{p} μ) θ_{m}^{(k - 1)} - η_{p} λ \frac{m _{m}^{(k)}}{( v _{m}^{(k)} ) ^{1/2} + ϵ}

W^{(k, c)} \leftarrow W_{j \cdot} = W_{j \cdot}^{(k, c - 1)}, j \neq = c arg min F^{(k, c - 1)} (W, θ^{(k)}, U^{(k)}),

W^{(k, c)} \leftarrow W_{j \cdot} = W_{j \cdot}^{(k, c - 1)}, j \neq = c arg min F^{(k, c - 1)} (W, θ^{(k)}, U^{(k)}),

F (W, θ^{(k)}, U^{(k)}) = unsupervised component L (W) + \frac{1}{2} c = 1 \sum C W_{c \cdot}^{⊤} A_{c}^{(k)} W_{c \cdot} + supervised component \frac{λ}{N} i = 1 \sum N m = 1 \sum M ℓ_{m} (W_{m \cdot}^{⊤} z_{i}, y_{i, m}, θ_{m}^{(k)}) + const (W),

F (W, θ^{(k)}, U^{(k)}) = unsupervised component L (W) + \frac{1}{2} c = 1 \sum C W_{c \cdot}^{⊤} A_{c}^{(k)} W_{c \cdot} + supervised component \frac{λ}{N} i = 1 \sum N m = 1 \sum M ℓ_{m} (W_{m \cdot}^{⊤} z_{i}, y_{i, m}, θ_{m}^{(k)}) + const (W),

A_{c}^{(k)} := \frac{1}{N T} i = 1 \sum N t = 1 \sum T U_{c, i, t}^{(k)} z_{i, t} z_{i, t}^{⊤} .

A_{c}^{(k)} := \frac{1}{N T} i = 1 \sum N t = 1 \sum T U_{c, i, t}^{(k)} z_{i, t} z_{i, t}^{⊤} .

F^{(k, c - 1)} (W, θ^{(k)}, U^{(k)}) := L (W) + \frac{1}{2} c = 1 \sum C W_{c \cdot}^{⊤} A_{c}^{(k)} W_{c \cdot} + λ ⟨ B_{c}^{(k)}, W ⟩ + \frac{1}{2 η _{u}} W - W^{(k, c - 1)}_{F}^{2},

F^{(k, c - 1)} (W, θ^{(k)}, U^{(k)}) := L (W) + \frac{1}{2} c = 1 \sum C W_{c \cdot}^{⊤} A_{c}^{(k)} W_{c \cdot} + λ ⟨ B_{c}^{(k)}, W ⟩ + \frac{1}{2 η _{u}} W - W^{(k, c - 1)}_{F}^{2},

\displaystyle\mathbf{B}_{c}^{(k)}:=\frac{1}{N}\sum_{i=1}^{N}\begin{bmatrix}\nabla_{\mathbf{s}}\ell_{1}(\mathbf{s},y_{i,1},\boldsymbol{\theta}_{1}^{(k)})\big{|}_{\mathbf{s}=(\mathbf{W}_{1\cdot}^{(k,c-1)})^{\top}\mathbf{z}_{i}}\\ \vdots\\ \nabla_{\mathbf{s}}\ell_{M}(\mathbf{s},y_{i,M},\boldsymbol{\theta}_{M}^{(k)})\big{|}_{\mathbf{s}=(\mathbf{W}_{M\cdot}^{(k,c-1)})^{\top}\mathbf{z}_{i}}\\ \operatorname{\mathbf{0}}_{(C-M)\times T}\end{bmatrix}\mathbf{z}_{i}^{\top},

\displaystyle\mathbf{B}_{c}^{(k)}:=\frac{1}{N}\sum_{i=1}^{N}\begin{bmatrix}\nabla_{\mathbf{s}}\ell_{1}(\mathbf{s},y_{i,1},\boldsymbol{\theta}_{1}^{(k)})\big{|}_{\mathbf{s}=(\mathbf{W}_{1\cdot}^{(k,c-1)})^{\top}\mathbf{z}_{i}}\\ \vdots\\ \nabla_{\mathbf{s}}\ell_{M}(\mathbf{s},y_{i,M},\boldsymbol{\theta}_{M}^{(k)})\big{|}_{\mathbf{s}=(\mathbf{W}_{M\cdot}^{(k,c-1)})^{\top}\mathbf{z}_{i}}\\ \operatorname{\mathbf{0}}_{(C-M)\times T}\end{bmatrix}\mathbf{z}_{i}^{\top},

W^{(k, c)} = (e_{1 : c - 1}, r_{c}, e_{c + 1 : C})^{⊤} W^{(k, c - 1)}

W^{(k, c)} = (e_{1 : c - 1}, r_{c}, e_{c + 1 : C})^{⊤} W^{(k, c - 1)}

L (W) = L ((e_{1 : c - 1}, r_{c}, e_{c + 1 : C})^{⊤} W^{(k, c - 1)}) = lo g ∣ r_{c, c} ∣ + L (W^{(k, c - 1)}),

L (W) = L ((e_{1 : c - 1}, r_{c}, e_{c + 1 : C})^{⊤} W^{(k, c - 1)}) = lo g ∣ r_{c, c} ∣ + L (W^{(k, c - 1)}),

r_{c} := K^{- 1} (\frac{1}{r _{c, c}} e_{c} + b),

r_{c} := K^{- 1} (\frac{1}{r _{c, c}} e_{c} + b),

η_{u} \leq [2 λ (\frac{1}{N} \sum_{i = 1}^{N} ∥ z_{i} ∥_{2, 2}^{2}) \sum_{m = 1}^{M} L_{m}^{2}]^{- 1} and η_{p} \leq (L_{θ} + μ)^{- 1},

η_{u} \leq [2 λ (\frac{1}{N} \sum_{i = 1}^{N} ∥ z_{i} ∥_{2, 2}^{2}) \sum_{m = 1}^{M} L_{m}^{2}]^{- 1} and η_{p} \leq (L_{θ} + μ)^{- 1},

F_{μ} (W^{(k)}, θ^{(k)}, U^{(k)}) \leq F_{μ} (W^{(k - 1)}, θ^{(k - 1)}, U^{(k - 1)}) .

F_{μ} (W^{(k)}, θ^{(k)}, U^{(k)}) \leq F_{μ} (W^{(k - 1)}, θ^{(k - 1)}, U^{(k - 1)}) .

U^{(k)} \leftarrow U \geq 0^{N \times C \times T} arg min {F (W^{(k - 1)}, θ^{(k - 1)}, U) + \frac{1}{2 η _{a}} i, c, t \sum (U_{i, c, t} - U_{i, c, t}^{(k - 1)})^{2}},

U^{(k)} \leftarrow U \geq 0^{N \times C \times T} arg min {F (W^{(k - 1)}, θ^{(k - 1)}, U) + \frac{1}{2 η _{a}} i, c, t \sum (U_{i, c, t} - U_{i, c, t}^{(k - 1)})^{2}},

Ψ (U, θ, W)

Ψ (U, θ, W)

g_{m}^{(k)}

g_{m}^{(k)}

A_{c}^{(k)}

B_{c}^{(k)}

O (n (d + M C T + τ C^{3})) .

O (n (d + M C T + τ C^{3})) .

AD (W^{- 1}, A) = j = 1 \sum C [(c = 1 \sum C \frac{R _{j c}}{max _{c^{'}} R _{j c^{'}}} - 1) + (c = 1 \sum C \frac{R _{c j}}{max _{c^{'}} R _{c^{'} j}} - 1)],

AD (W^{- 1}, A) = j = 1 \sum C [(c = 1 \sum C \frac{R _{j c}}{max _{c^{'}} R _{j c^{'}}} - 1) + (c = 1 \sum C \frac{R _{c j}}{max _{c^{'}} R _{c^{'} j}} - 1)],

h (r) := \frac{1}{2} r^{⊤} Kr - lo g ∣ r_{j} ∣ - ⟨ b, r ⟩,

h (r) := \frac{1}{2} r^{⊤} Kr - lo g ∣ r_{j} ∣ - ⟨ b, r ⟩,

r^{⋆} := K^{- 1} (\frac{1}{r _{j}^{⋆}} e_{c} + b),

r^{⋆} := K^{- 1} (\frac{1}{r _{j}^{⋆}} e_{c} + b),

\nabla h (r) = Kr - \frac{1}{r _{j}} e_{j} - b = 0

\nabla h (r) = Kr - \frac{1}{r _{j}} e_{j} - b = 0

r = K^{- 1} (\frac{1}{r _{j}} e_{j} + b) .

r = K^{- 1} (\frac{1}{r _{j}} e_{j} + b) .

r_{j}^{2}

r_{j}^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Blind Source Separation Techniques · Water Systems and Optimization

Full text

Supervised Stochastic Gradient Algorithms

for Multi-Trial Source Separation

Ronak Mehta1 Mateus Piovezan Otto1 Noah Stanis2

Azadeh Yazdan-Shahmorad2 Zaid Harchaoui1

1Department of Statistics, University of Washington 2Department of Bioengineering, University of Washington

(August 28, 2025)

Abstract

We develop a stochastic algorithm for independent component analysis that incorporates multi-trial supervision, which is available in many scientific contexts. The method blends a proximal gradient-type algorithm in the space of invertible matrices with joint learning of a prediction model through backpropagation. We illustrate the proposed algorithm on synthetic and real data experiments. In particular, owing to the additional supervision, we observe an increased success rate of the non-convex optimization and the improved interpretability of the independent components.

1 Introduction

A fundamental inverse problem in statistical/machine learning, signal processing, and time series analysis is independent component analysis, or ICA (Hyvärinen et al., 2001; Cichocki and Amari, 2002; Comon and Jutten, 2010). Given a signal $\mathbf{z}\in\mathbb{R}^{C\times T}$ with $C$ components/channels and $T$ samples, the scientist pursues a source signal $\mathbf{s}\in\mathbb{R}^{C\times T}$ and an invertible mixing matrix $\mathbf{A}\in\mathbb{R}^{C\times C}$ such that 1) the rows of $\mathbf{s}$ are realizations of independent stochastic processes, and 2) the equality

[TABLE]

is satisfied. Equivalently, we seek an invertible unmixing matrix $\mathbf{W}\in\mathbb{R}^{C\times C}$ which recovers $\mathbf{A}^{-1}$ up to permutations and non-zero scalings of the rows (which do not affect the independence criterion). Applications include identifying individual voices/tracks in audio or disentangling correlated electrical activity from the brain. Often viewed as a data exploration/preprocessing technique, the independent sources are interpreted as underlying “drivers” of the signal, whereas the mixing matrix summarizes its connectivity/correlation structure via a simple linear transformation. Despite its widespread use in the natural and data sciences, the fully unsupervised setting of ICA limits its interpretability (ostensibly its main benefit over complex, nonlinear separation functions), as the contribution of each source may not be easily understood in a scientifically meaningful way. As an example of such an understanding, one may consider the identities of the voices in audio or the biological/behavioral function of the brain signals mentioned above. Moreover, from a technical viewpoint, ICA algorithms often hinge upon solving non-convex optimization problems over the space of invertible matrices, for which even state-of-the-art algorithms may fail consistently. Given the motivation from both a methods and an applications perspective, we develop in this paper a stochastic algorithm for solving the Infomax variant (Bell and Sejnowski, 1995; Lee et al., 1999; Amari, 1999) of ICA that incorporates multi-trial supervision in a flexible, model-agnostic fashion.

Precisely, consider a dataset of $N$ observations $(\mathbf{z}_{1},\mathbf{y}_{1}),\ldots,(\mathbf{z}_{N},\mathbf{y}_{N})$ , where each $\mathbf{z}_{i}\in\mathbb{R}^{C\times T}$ is a multivariate signal and each $\mathbf{y}_{i}=(y_{i,1},\ldots,y_{i,M})$ is a collection of discrete or continuous labels. We seek an unmixing matrix $\mathbf{W}$ that can be applied to all signals, using the given labels as additional guidance to uncover the sources. This formalization is heavily motivated by the neuroscience application mentioned above, as in this setting, signals are often collected via multiple trials with possibly different conditions, interventions, and behaviors of the individual being measured. The aforementioned interpretability goal may be realized if (some number of) independent sources have a direct correspondence to each supervision label. Accordingly, assuming that $M\leq C$ , we consider optimization problems of the form

[TABLE]

where $\ell_{0}$ is the unsupervised loss that promotes independence between the sources, $\ell_{1},\ldots,\ell_{M}$ denote supervised losses that promote dependence between individual sources and labels, $\mathbf{W}_{m\cdot}$ denotes the $m$ -th row of $\mathbf{W}$ , $\boldsymbol{\theta}=(\boldsymbol{\theta}_{1},\ldots,\boldsymbol{\theta}_{M})$ denotes the parameters of predictive models (such as linear transformations or neural networks), and finally, $\lambda\geq 0$ is a balancing hyperparameter. In general, the objective may be non-convex in both $\mathbf{W}$ and $\boldsymbol{\theta}$ , and each $\ell_{m}$ is only assumed to be differentiable in its first and third arguments.

Contributions

We propose a stochastic optimization algorithm for (2) which combines a proximal gradient-type update in $\mathbf{W}$ (effectively handling the invertibility constraint) and a generic gradient-based update scheme of the user’s choice for $\boldsymbol{\theta}$ . In particular, one may use the Adam/AdamW class of algorithms (Loshchilov and Hutter, 2019), popularly applied in neural network training, to update $\boldsymbol{\theta}$ .111In our experiments, we find that adaptive updates improve performance even when the objective is convex in $\boldsymbol{\theta}$ . We prove a monotonicity guarantee and convergence to a stationary point of the objective. On experiments with synthetic data, we demonstrate the ability of supervision to decrease the failure rate of non-convex optimization trajectories, as illustrated in Fig.˜1. On real neural data collected from non-human primates, we demonstrate the ability of the individual sources to retain scientific information related to the experimental protocol and the animal’s behavior during each trial. We derive the algorithm in Sec.˜2, provide experiments on synthetic and real neural data in Sec.˜3, and give concluding remarks in Sec.˜4.

Related Work

Both direct gradient and natural/Riemannian gradient approaches have been employed for solving ICA under the Infomax principle, often facing convergence issues due to the likelihood-based loss $\ell_{0}$ including a log-determinant component (Montoya-Martínez et al., 2017). While ICA algorithms that use a weighted sum of unsupervised and supervised components in the objective have been explored previously, they either apply gradient-based methods to $\mathbf{W}$ that inherit the aforementioned convergence issues or are model-specific with respect to $\boldsymbol{\theta}$ , i.e., place strong assumptions on the supervised losses $\ell_{1},\ldots,\ell_{M}$ (Chen and He, 2002; Kotani et al., 2004; Takabatake et al., 2007). Other approaches for learning $\mathbf{W}$ exist, such as fixed-point/matrix decomposition schemes, which also may not adapt to arbitrary supervision models (Zou et al., 2022; Su et al., 2024). We maintain generic gradient-based optimization via backpropagation for $\boldsymbol{\theta}$ , which is undisputably effective for supervised learning problems. For the $\mathbf{W}$ update, we are particularly inspired by the majorization-minimization class of approaches (Ono, 2011; Ablin et al., 2019; Scheibler and Ono, 2020; Brendel et al., 2020; Brendel and Kellermann, 2021; Ikeshita et al., 2021; Ikeshita and Nakatani, 2022), of which Ablin et al. (2019) minimizes a per-iteration objective row-by-row. We generalize their per-iteration objective both by incorporating a proximity term to stabilize the $\mathbf{W}$ trajectory and a linearized supervision term, which can be computed via backpropagation. A proximal approach for the related problem of Gaussian independent vector analysis has been explored (Cosserat et al., 2023), but not for ICA with a generic likelihood and supervision model. On the applied side, the usage of independent component/vector analysis on neural and cognition data is thoroughly established (Lehmann et al., 2022; Moraes et al., 2023; Yang et al., 2023; Fouladivanda et al., 2023; Heurtebise et al., 2023; Keding et al., 2024; Laport et al., 2024; Vu et al., 2024; Gjølbye et al., 2024; Hu et al., 2025). Although we focus on interpreting linearly mixed sources in this paper, nonlinear deep learning-based variants of ICA have also gained interest recently (Nguyen et al., 2021; Li et al., 2022; Narisawa et al., 2021; Hermann et al., 2022; Romano et al., 2023).

2 Methods

Variational Objective

To specify the method, we first write a variational form of the objective (2), which introduces a third set of variables we denote as $\mathbf{U}\in\mathbb{R}^{N\times C\times T}$ . The algorithm jointly minimizes over $(\mathbf{W},\boldsymbol{\theta},\mathbf{U})$ , and is fully specified by first providing full batch updates for each variable, and then describing stochastic estimates thereof. To proceed, we require the following mild assumption on our unmixing loss function $\ell_{0}$ .

Assumption 1 (Super-Gaussian Likelihood).

For any invertible $\mathbf{W}\in\mathbb{R}^{C\times C}$ and $\mathbf{z}\in\mathbb{R}^{C\times T}$ , it holds that

[TABLE]

where $x\mapsto e^{-g(x)}$ is integrable, $x\mapsto g(\sqrt{x})$ is increasing and concave on $(0,\infty)$ , and $L(\mathbf{W}):=-\log\left|\operatorname{det}\left(\mathbf{W}\right)\right|$ .

Under Asm.˜1, a constant can be added to $g$ so that, without loss of generality, $e^{-g(\cdot)}$ is a probability density function for a super-Gaussian random variable by definition. This includes the Laplace density $g(x)\sim\left|x\right|$ or the Huber density $g(x)\sim h(x)$ , where $h(x)=\frac{1}{2}x^{2}$ for $\left|x\right|\leq 1$ and $h(x)=\left|x\right|-1/2$ otherwise. Such a function would naturally arise when using the density transformation formula to express the likelihood of each entry $z_{c,t}$ of $\mathbf{z}$ using the likelihood of each entry of the candidate source signal $\mathbf{W}\mathbf{z}$ . In turn, Palmer et al. (2005, Theorem 1) grants the variational form

[TABLE]

where $\operatorname{\mathbf{0}}_{C\times T}$ denotes the matrix of zeros in $\mathbb{R}^{C\times T}$ , and $f:[0,\infty)\rightarrow\mathbb{R}$ is a convex function whose particular form is unimportant for our purposes. Combining (2), (3), and Asm.˜1, we derive the equivalent problem to (2) of minimizing

[TABLE]

over $\mathbf{W}$ , $\boldsymbol{\theta}$ , and $\mathbf{U}$ , which ranges over the non-negative tensors within $\mathbb{R}^{N\times C\times T}$ . We proceed to describe the updates for the three sets of variables, defining the sequence $(\mathbf{U}^{(k)},\boldsymbol{\theta}^{(k)},\mathbf{W}^{(k)})_{k\geq 0}$ . They are performed in a block coordinate-wise manner, ordered by the auxiliary variables $\mathbf{U}^{(k)}$ first, the model parameters $\boldsymbol{\theta}^{(k)}$ second, and the unmixing matrix $\mathbf{W}^{(k)}$ last. The majority of the complexity lies in the $\mathbf{W}^{(k)}$ update, whereas the other variables can be updated using exact minimizations or gradient-based updates.

Auxiliary Variables and Model Parameters Update

For the auxiliary variables, we simply perform the exact minimization

[TABLE]

Differentiating both sides of (3) with respect to each $[\mathbf{W}\mathbf{z}]_{c,t}$ yields that the optimum $\mathbf{U}^{(k)}$ is achieved by setting $U_{i,c,t}^{(k)}=g^{\prime}([\mathbf{W}\mathbf{z}_{i}]_{c,t})/[\mathbf{W}\mathbf{z}_{i}]_{c,t}$ , which can be done in a vectorized manner. As mentioned in Sec.˜1, we apply a generic gradient-based update scheme for the model parameter $\boldsymbol{\theta}^{(k)}$ . That is, for each $m=1,\ldots,M$ , we have

[TABLE]

for model parameter learning rate $\eta_{\mathrm{p}}>0$ and gradient formula

[TABLE]

For example, $\boldsymbol{\theta}^{(k)}_{m}=(1-\eta_{\mathrm{p}}\mu)\boldsymbol{\theta}^{(k)}_{m}-\eta_{\mathrm{p}}\mathbf{g}^{(k)}_{m}$ represents gradient descent with weight decay parameter $\mu>0$ . Algorithms such as AdamW (Loshchilov and Hutter, 2019) can also be employed, with update

[TABLE]

for momentum-based stochastic estimate $\mathbf{m}^{(k)}_{m}$ of (6), variance pre-conditioner $\mathbf{v}^{(k)}_{m}$ , and tolerance parameter $\epsilon>0$ . The division and square root operations are interpreted element-wise. We employ this update in the experiments shown in Sec.˜3.

Unmixing Matrix Update

We consider a cyclic coordinate-wise procedure in which $\mathbf{W}^{(k)}$ will be defined by a sequence of intermediate values $\mathbf{W}^{(k,0)},\ldots,\mathbf{W}^{(k,C)}$ , such that $\mathbf{W}^{(k,0)}=\mathbf{W}^{(k-1)}$ , $\mathbf{W}^{(k)}=\mathbf{W}^{(k,C)}$ , and $\mathbf{W}^{(k,c)}$ will differ from $\mathbf{W}^{(k,c-1)}$ by updating the $c$ -th row based on minimizing an approximation of our original objective. In particular,

[TABLE]

where $F^{(k,c-1)}$ approximates (4) in the region close to $\mathbf{W}^{(k,c-1)}$ . The definition of $F^{(k,c-1)}$ will enforce that $\mathbf{W}^{(k,c)}$ remains invertible. By describing this per-iteration objective and the means to optimize it, we fully specify the update of the unmixing matrix and the overall algorithm. Three technical ingredients form the basis for this objective function: a linearization of the supervised component, a proximity term to promote closeness to $\mathbf{W}^{(k,c-1)}$ , and a reparametrization that allows for the minimization to be solved in closed form.

To describe the first two components, we first write the original objective in a simplified manner using

[TABLE]

where $\mathrm{const}(\cdot)$ is independent of its input and $\mathbf{A}_{1}^{(k)},\dots,\mathbf{A}_{C}^{(k)}$ are $C$ square matrices defined by

[TABLE]

The supervised portion of (8) can be linearized by its matrix derivative. Letting ${\left\langle\mathbf{A},\mathbf{B}\right\rangle}:=\operatorname{tr}(\mathbf{A}^{\top}\mathbf{B})$ for $\mathbf{A},\mathbf{B}\in\mathbb{R}^{C\times C}$ , we specify (7) by defining

[TABLE]

for the square matrix

[TABLE]

unmixing learning rate $\eta_{\mathrm{u}}>0$ and Frobenius norm $\left\lVert\cdot\right\rVert_{\mathrm{F}}$ . To motivate the third and final ingredient, notice the difficulty of optimizing the non-smooth, non-convex objective (10) resulting from the log-determinant term $L(\mathbf{W})$ . To handle this, we follow a similar reparametrization trick to Ablin et al. (2019, Lemma 3). Let $\mathbf{e}_{l:m}\in\mathbb{R}^{C\times(m-l)}$ denote the matrix containing the $l$ -th through $m$ -th standard basis vectors along its columns, and notice that due to invertibility of $\mathbf{W}^{(k,c-1)}$ , it holds that

[TABLE]

for a vector satisfying $(\mathbf{W}^{(k,c)}_{c\cdot})^{\top}=\mathbf{r}_{c}^{\top}\mathbf{W}^{(k,c-1)}$ , or equivalently, $\mathbf{r}_{c}=((\mathbf{W}^{(k,c-1)})^{\top})^{-1}\mathbf{W}^{(k,c)}_{c\cdot}\in\mathbb{R}^{C}$ . By substituting $\mathbf{W}$ in (10) with the right-hand side of (12), we can solve the per-iteration problem (7) by optimizing for $\mathbf{r}_{c}$ directly over $\mathbb{R}^{C}$ . To hint as to why this would be useful, observe that

[TABLE]

where $r_{c,c}\neq 0$ for any feasible $\mathbf{W}$ , reducing this term to a univariate function in the decision variables. The exact solution to (7) is given by Prop.˜1 below, where we also define the standard basis vector $\mathbf{e}_{c}:=\mathbf{e}_{c:c}$ .

Proposition 1.

Assume that $\mathbf{A}_{c}^{(k)}$ is positive definite for all $c\in[C]$ . Each $\mathbf{r}_{c}$ from (12) is given by

[TABLE]

for $\mathbf{K}:=\mathbf{W}^{(k,c-1)}(\mathbf{A}_{c}^{(k)}+\eta_{\mathrm{u}}^{-1}\mathbf{I})(\mathbf{W}^{(k,c-1)})^{\top}$ , $\mathbf{b}:=\mathbf{W}^{(k,c-1)}(\eta_{\mathrm{u}}^{-1}\mathbf{W}_{c\cdot}^{(k,c-1)}-(\mathbf{B}_{c}^{(k)})_{c\cdot})$ , and $r_{c,c}:=\sqrt{\mathbf{K}^{-1}_{cc}+\tfrac{1}{4}(\mathbf{K}^{-1}\mathbf{b}})_{c}^{2}+\tfrac{1}{2}(\mathbf{K}^{-1}\mathbf{b})_{c}$ .

As an important simplification, notice that the update given by Prop.˜1 only depends on $\mathbf{B}_{c}^{(k)}$ through its $c$ -th row, which, inspecting (11), in turn depends on $\mathbf{W}_{c\cdot}^{(k,c-1)}$ . However, it always holds that $\mathbf{W}_{c\cdot}^{(k,c-1)}=\mathbf{W}_{c\cdot}^{(k,0)}=\mathbf{W}_{c\cdot}^{(k-1)}$ , because the $c$ -th row of $\mathbf{W}^{(k-1)}$ has not yet been updated. Thus, one can compute the matrix $\mathbf{B}_{1}^{(k)}$ only once and use the same one for all updates, which is reflected in Alg.˜1.

Convergence Analysis

To justify the three updates above, we also derive the conditions under which they yield a monotonically non-increasing sequence of objective values, commonly sought in structured non-convex problems such as ICA and mixture modeling. We then convert this guarantee into an asymptotic convergence analysis of the sequence under assumptions on the optimization trajectory. For concreteness, we consider the gradient descent update $\boldsymbol{\theta}^{(k)}_{m}=(1-\eta_{\mathrm{p}}\mu)\boldsymbol{\theta}^{(k)}_{m}-\eta_{\mathrm{p}}\mathbf{g}^{(k)}_{m}$ , which applies to the $\ell_{2}$ -regularized objective $F_{\mu}(\mathbf{W},\boldsymbol{\theta},\mathbf{U})=F(\mathbf{W},\boldsymbol{\theta},\mathbf{U})+\frac{\mu}{2}\left\lVert\boldsymbol{\theta}\right\rVert_{2}^{2}$ . Monotonicity can be achieved when the supervised objective is smooth with respect to the source and parameter.

Assumption 2.

Assume the following for any fixed target $\mathbf{y}=(y_{1},\ldots,y_{m})$ , parameter $\boldsymbol{\theta}=(\boldsymbol{\theta}_{1},\ldots,\boldsymbol{\theta}_{M})$ , and source component $\mathbf{s}\in\mathbb{R}^{T}$ . The function $\mathbf{s}\mapsto\nabla_{\mathbf{s}}\ell_{m}(\mathbf{s},y_{m},\boldsymbol{\theta}_{m})$ is $L_{m}$ -Lipschitz continuous w.r.t. $\left\lVert\cdot\right\rVert_{2}$ , the function $\boldsymbol{\theta}_{m}\mapsto\nabla_{\boldsymbol{\theta}}\ell_{m}(\mathbf{s},y_{m},\boldsymbol{\theta}_{m})$ is $L_{\boldsymbol{\theta}}$ w.r.t. $\left\lVert\cdot\right\rVert_{2}$ , and $\mathbf{s}\mapsto\nabla_{\boldsymbol{\theta}}\ell_{m}(\mathbf{s},y_{m},\boldsymbol{\theta}_{m})$ is $\bar{L}$ -Lipschitz continuous w.r.t. $\left\lVert\cdot\right\rVert_{2}$ .

This is the only assumption required for the descent guarantee Lem.˜2, which is proven in Appx.˜A.

Lemma 2.

Under Asm.˜1, Asm.˜2, and the conditions

[TABLE]

we have that for all $k\geq 1$ , the inequality

[TABLE]

holds. Consequently, if $F$ is bounded from below, then $F_{\mu}(\mathbf{W}^{(k)},\boldsymbol{\theta}^{(k)},\mathbf{U}^{(k)})$ converges to a finite limit $F^{\star}$ as $k\rightarrow\infty$ .

The result is proven by combining descent inequalities for each update, including each inner loop iteration of updates to the rows of the unmixing matrix. Practically, we see that $\eta_{\mathrm{u}}$ can be scaled using the hyperparameter $\lambda$ and the average spectral norm $\left\lVert\cdot\right\rVert_{2,2}$ of the input signals, which can be normalized during preprocessing. For convergence to a stationary point, we require a number of additional assumptions and a slight modification of the algorithm, which introduces a proximity term to the $\mathbf{U}^{(k)}$ update. First, we define $\left\lVert\mathbf{U}\right\rVert_{\mathrm{F}}^{2}:=\sum_{i,c,t}U_{i,c,t}^{2}$ for $\mathbf{U}\in\mathbb{R}^{N\times C\times T}$ , and we change (5) to read as

[TABLE]

which requires knowledge of the function $f$ in (4) in order to implement. We find in experimentation that the proximal variant of the update above performs indistinguishably from the exact minimization, which can be implemented without ever deriving $f$ for a particular likelihood. However, this variant leads to stronger theoretical guarantees. To gracefully handle the requirement that $\mathbf{U}\geq\operatorname{\mathbf{0}}_{N\times C\times T}$ when minimizing (4), we define the convex indicator function $\iota_{+}:\mathbb{R}^{N\times C\times T}\rightarrow\left\{0,+\infty\right\}$ such that $\iota_{+}(\mathbf{U})=0$ when $\mathbf{U}\geq\operatorname{\mathbf{0}}_{N\times C\times T}$ and $\iota_{+}(\mathbf{U})=+\infty$ otherwise, and denote

[TABLE]

Minimizing $\Psi$ over $(\mathbf{U},\boldsymbol{\theta},\mathbf{W})$ is a non-smooth, non-convex optimization problem. To define stationarity formally, we require the concept of a Fréchet subdifferential; we leave the technical details to Appx.˜B and invite the reader to interpret stationarity in the usual “zero gradient” sense in the result below.

Proposition 3.

Assume the conditions of Lem.˜2 and additionally, Asms.˜3, 4 and 5 from Appx.˜B. The sequence of iterates $(\mathbf{W}^{(k)},\boldsymbol{\theta}^{(k)},\mathbf{U}^{(k)})_{k\geq 0}$ produced by Alg.˜1 using the update (14) converges to a stationary point of the objective (15).

The proof is given in Appx.˜B and relies on two broad steps. The first is to use an improvement of the inequality (13) to conclude that $(\mathbf{W}^{(k-1)}-\mathbf{W}^{(k)},\boldsymbol{\theta}^{(k-1)}-\boldsymbol{\theta}^{(k)},\mathbf{U}^{(k-1)}-\mathbf{U}^{(k)})$ converges to zero. Then, we compute a particular sequence of Fréchet subgradients whose norms can be upper bounded by norms of the gaps between successive iterates. The second step is non-trivial, as the Alg.˜1 combines standard proximal, proximal gradient, and cyclic coordinate-wise updates which each have different optimality conditions.

Full Batch and Stochastic Variants

We describe the computational complexity per iteration of both the full batch Alg.˜1 and a natural stochastic/incremental variant, which leverages the finite-sum structure of the various quantities at play. The three memory bottlenecks arise from the storage of the auxiliary variables $\mathbf{U}^{(k)}$ , the model parameter $\boldsymbol{\theta}\in\mathbb{R}^{d}$ , and the matrices $\mathbf{A}^{(k)}_{1},\ldots,\mathbf{A}^{(k)}_{C}\in\mathbb{R}^{C\times C}$ . Observing that $\mathbf{A}^{(k)}_{c-1}$ can be discarded on iterate $c$ of the inner loop, the total space complexity of the full batch algorithm is $O(NCT+C^{2}+d)$ . For the time complexity, assuming that the gradient-based $\boldsymbol{\theta}^{(k)}$ update can occur in $O(Nd)$ time, the remaining bottleneck is the computation of $\{(\mathbf{A}^{(k)}_{c},\mathbf{B}_{c}^{(k)})\}_{c=1}^{C}$ . Because $M\leq C$ , this yields a time complexity of $O(N(d+C^{3}T))$ .

While the space complexity simply reflects the memory needed to store the decision variables and data, the time complexity can be improved using sample average approximations of the matrices $(\mathbf{A}^{(k)}_{1},\mathbf{B}^{(k)}_{1}),\ldots,(\mathbf{A}^{(k)}_{C},\mathbf{B}^{(k)}_{C})$ . In particular, consider indices $\mathcal{I}=(i_{1},\ldots,i_{n})$ drawn uniformly without replacement from $[N]$ and $\mathcal{T}=(t_{1},\ldots,t_{\tau})$ drawn uniformly without replacement from $[T]$ . Letting $\mathbb{E}_{k-1}$ denote the conditional expectation over the randomness in these indices given $(\mathbf{U}^{(k-1)},\boldsymbol{\theta}^{(k-1)},\mathbf{W}^{(k-1)})$ , we have the three identities

[TABLE]

To derive a stochastic version of the algorithm, we not only use the integrands above as sample average approximations of their expectations, but we also edit line 6 to only compute $U^{(k)}_{i,c,t}$ for $i\in\mathcal{I}$ and $t\in\mathcal{T}$ , setting $U^{(k)}_{i,c,t}\leftarrow U^{(k-1)}_{i,c,t}$ otherwise. The time complexity of the resulting algorithm is given by

[TABLE]

The first term comes from the estimation of $\mathbf{g}^{(k)}$ , the second term comes from the estimation of $\mathbf{B}_{c}^{(k)}$ , and the third term results from the estimation of each $\mathbf{A}^{(k)}_{c}$ . Importantly, even though only $\tau$ timesteps are used for $\mathbf{B}_{c}^{(k)}$ , the full gradient with respect to the $\mathbb{R}^{C\times T}$ -valued input is needed before subsetting. After this, the remaining computation needed to compute the $C\times C$ matrix is $O(nMC^{2}\tau)$ , which is absorbed into the final term of (16) because $M\leq C$ . We evaluate the method in a variety of synthetic and real data settings in the next section.

3 Experimental Results

In our experiments, we aim to determine the benefits of incorporating both multiple trials and supervision into the learning process. Via both simulated and real data experiments, we also provide practical recommendations on hyperparameter tuning for Alg.˜1. As for quantitative measurement, we recall from Sec.˜1 that we are interested in recovering $\mathbf{A}$ via $\mathbf{W}^{-1}$ down to permutations and scalings of the rows. Thus, we measure success via the Amari distance criterion

[TABLE]

where $R_{jc}:=[\mathbf{W}\mathbf{A}]_{jc}$ . This quantity is made zero if and only if $\mathbf{W}\mathbf{A}=\mathbf{P}\mathbf{D}$ for a permutation matrix $\mathbf{P}\in\left\{0,1\right\}^{C\times C}$ and invertible diagonal matrix $\mathbf{D}\in\mathbb{R}^{C\times C}$ , naturally inspriring the goal of making (17) as small as possible (see Fig.˜2 for an illustration). Notice that the Amari distance can only be computed if the true mixing matrix $\mathbf{A}$ is known, making it appropriate only for simulation studies.

For real data experiments, we instead evaluate the method according to the scientific objectives outlined in Sec.˜1; we evaluate the ability of the estimated sources to retain an explanable dependence with the labels used for supervision. This is measured both quantitatively by prediction metrics such as accuracy and qualitatively using visualizations of the learned sources on held-out data. Code and additional details for reproducibility can be found at https://ronakdm.github.io/_pages/software.

Effect of Multiple Trials

We evaluate MultiICA against various baselines for single-trial (unsupervised) ICA using the Amari distance. We compare to the cumulant-based methods FOBI (Cardoso, 1989) and JADE (Cardoso and Souloumiac, 1993). As is commonly used as a non-Gaussian synthetic data benchmark for ICA, we draw $\mathbf{s}_{1},\ldots,\mathbf{s}_{N}\in\mathbb{R}^{C\times T}$ with independent $\operatorname{Laplace}(1)$ entries and $\mathbf{A}$ with standard normal entries, for $(N,C,T)=(80,10,1000)$ . First, for each trial, we compute the Amari distances $d_{1},\ldots,d_{N}$ , where $d_{i}=\operatorname{AD}(\mathbf{W}^{-1}_{\mathrm{Base}}(\mathbf{z}_{i}),\mathbf{A})$ and $\mathbf{W}_{\mathrm{Base}}(\mathbf{z}_{i})$ denotes the output of the baseline consuming the data of the $i$ -th trial. We then compared the mean and median Amari distance over trials to $\operatorname{AD}(\mathbf{W}^{-1}_{\mathrm{MultiICA}},\mathbf{A})$ for $(n,\tau,\eta_{u})=(10,64,0.1)$ , and averaged the results over 50 random seeds, as shown in the left panel of Fig.˜3. To understand the degree to which the observed improvement of MultiICA over baselines is due to the larger sample size available for it (all $NT$ time points versus $T$ per-trial for the baselines), we compute the Amari distance $\operatorname{AD}(\mathbf{W}^{-1}_{\mathrm{Base}}([\mathbf{z}_{1}|\cdots|\mathbf{z}_{N}]),\mathbf{A})$ where $[\mathbf{z}_{1}|\cdots|\mathbf{z}_{N}]\in\mathbb{R}^{C\times NT}$ is the concatenation of the trials along the time axis. As shown in the right panel of Fig.˜3, MultiICA consistently estimates the unmixing matrix with Amari distance two orders of magnitude lower than FOBI, but is outperformed by JADE. Note, however, that both JADE and FOBI are full batch methods, and that the time complexity of JADE on concatenated data is $O(C^{5})$ per iteration with an $O(NTC^{4})$ initialization. In contrast, using batch sizes $n\leq N$ for the trials and $\tau\leq T$ for the samples, MultiICA converges to an approximate solution with an $O\left(n\tau C^{3}\right)$ cost per iteration.

Effect of Supervision

Next, we isolate the effect of supervision and the relationship between the hyperparameters $(\lambda,\eta_{\mathrm{u}},\eta_{\mathrm{p}})$ used in Alg.˜1. The sources are similarly constructed with independent $\operatorname{Laplace}(1)$ entries $(N,C,T)=(6000,10,1000)$ . We generate $M=3$ regression targets using spectrogram features of the independent sources. That is, letting $\phi:\mathbb{R}^{T}\rightarrow\mathbb{R}^{d}$ denote a differentiable feature map. Then, for $i=1,\ldots,N$ and $m=1,\ldots,M$ , we compute labels $y_{i,m}={\left\langle\boldsymbol{\theta}_{m},\phi(\mathbf{s}_{i,m})\right\rangle}$ for $\boldsymbol{\theta}_{m}$ generated with standard normal entries. For $\mathbf{A}$ , we generate a matrix using the eigenspaces of the Hilbert matrix, with condition number $\exp(\kappa)$ parametrized by the constant $\kappa>0$ . The results for $\kappa\in\left\{5,7\right\}$ and $\lambda$ ranging between $3\cdot 10^{-5}$ and $1\cdot 10^{-3}$ are shown in Fig.˜4 for 80 seeds and batch sizes $n=\tau=128$ . In all cases, we set $\eta_{\mathrm{u}}=0.001$ , which generated the most stable and fast performance across various seeds, distributions, and settings. A crucial observation in Fig.˜4 is that for $\lambda=0$ , a non-trivial proportion of trajectories fail (e.g., converge to a local minimum) using only the unsupervised component of the objective for guidance. This effect catastrophically harms the mean performance across seeds. On the other hand, the final Amari distance actually increases when the supervision parameter increases, which is likely due to the bias of the jointly learned parameters $(\boldsymbol{\theta}^{(k)}_{m})_{k\geq 0}$ . Thus, by applying a small amount of supervision, the mixing matrix can be recovered (down to scaling and permutation) with significantly higher probability. Finally, based on both the synthetic examples and upcoming real data example, we recommend that practitioners simply fix $(\lambda,\eta_{\mathrm{u}})=(0.00003,0.001)$ and tune the model learning rate $\eta_{\mathrm{p}}$ , which is empirically seen to be 1-2 orders of magnitude smaller than $\eta_{\mathrm{u}}$ . Notably, the sequence $\mathbf{W}^{(k)}$ may converge much faster as $k\rightarrow\infty$ than $\boldsymbol{\theta}^{(k)}$ during the joint learning process.

Application to Neural Data

To demonstrate the applicability of the MultiICA method to neural data, we applied it to micro-electrocorticography ( $\mu$ ECoG) signals recorded from $C=8$ locations on the posterior parietal cortex of an adult male rhesus macaque performing a reaching task, both with and without external stimulation. The $\mu$ ECoG signals were sampled at 1000Hz for 2 seconds, resulting in $T=2000$ samples. Optogenetic stimulation was used to deactivate neurons that had been genetically modified to respond to light. The behavioral task involved reaching in one of four cardinal directions, with optogenetic deactivation applied in approximately 50% of the $N=394$ training trials. The reach direction and the stimulation protocol (whether stimulation was applied) generate $M=2$ discrete supervised targets on each trial. We use the same predictive model as in Fig.˜4 and $\lambda=0.00003$ , with batch sizes $(n,\tau)=(32,64)$ . As described at the start of this section, because there is no known mixing matrix $\mathbf{A}$ , we evaluate the quality of $\mathbf{W}$ in Fig.˜5 by illustrating how the learned sources retain information about stimulation and reach direction.

Regarding the unsupervised and supervised losses, we observe the effect of the unmixing matrix converging much faster than the model parameters. Furthermore, while the unmixing matrix is the ultimate object of interest in our setting, if the supervised model learned by the algorithm is in fact useful for downstream applications, we recommend either taking the best model from multiple seeds (as in Fig.˜5) or training another supervised model on top of the independent sources. Indeed, the actual classification performance of the learned model may vary significantly (see Fig.˜5, bottom), even if the loss of the unmixing matrix does not. We find that 5 seeds are sufficient to get close to the best possible performance on average, which corresponds to the blue lines in Fig.˜5. To illustrate the explainability aspect, we plot vector representations of the learned sources on the right panels of Fig.˜5. These representations are the first two principal components of the flattened spectrograms of the source corresponding to the stimulation protocol target, showing an affine shift in the representation space associated with stimulation.

4 Conclusion

We proposed an ICA algorithm that incorporates supervision from multiple trials with auxiliary target variables to improve the learning trajectory of the unmixing matrix. The algorithm combines proximal gradient-type updates for the unmixing matrix with generic, backpropagation-based learning for a supervised model that is jointly learned with the unmixing matrix. We illustrated the method on synthetic and real data, and discussed practical hyperparameter selection for users. Future work includes theoretical convergence analysis and incorporating orthogonality constraints, as in FastICA.

Acknowledgments

This work was supported by NSF DMS-2023166, CCF-2019844, DMS-2134012, NIH. The authors are grateful to E. Shea-Brown and A. Shojaie for fruitful discussions.

Appendix A Descent Guarantees

A.1 Derivations

We derive the update given in Prop.˜1, starting with a technical lemma related to objectives that result from linear combinations of quadratic and log-determinant terms.

Lemma 4.

Consider a function $h:\mathbb{R}^{c}\rightarrow\mathbb{R}$ of the form

[TABLE]

where $\mathbf{K}\in\mathbb{R}^{C\times C}$ is invertible, $b\in\mathbb{R}^{C}$ , and we define $h(\mathbf{r})=+\infty$ when $r_{j}=0$ for $j\in[C]$ . Then, the minimizer of $h$ , denoted $\mathbf{r}^{\star}$ is given by

[TABLE]

for $r_{j}^{\star}=\sqrt{\mathbf{K}^{-1}_{jj}+\tfrac{1}{4}\left(\mathbf{K}^{-1}\mathbf{b}\right)_{j}^{2}}+\tfrac{1}{2}(\mathbf{K}^{-1}\mathbf{b})_{j}$ .

Proof.

Optimizing the function $h$ over $r_{j}>0$ or $r_{j}<0$ will result in the same gradient formula and a strongly convex function. Indeed, we have that

[TABLE]

which implies the relationship

[TABLE]

Given $r_{j}$ , it is easy to compute the rest of $\mathbf{r}$ . By looking at the $j$ -th coordinate of the relationship above, we have that

[TABLE]

because we fixed the convention that $r_{j}>0$ . Plugging this back into the equation (18) completes the proof. ∎

We may now prove Prop.˜1, which forms the basis for line 10 of Alg.˜1. See 1

Proof.

First, expanding the Frobenius norm term, observe the relationship

[TABLE]

for $\mathbf{A}_{c}:=\mathbf{A}_{c}^{(k)}+\eta_{\mathrm{u}}^{-1}\mathbf{I}$ and $\mathbf{B}=\lambda\mathbf{B}^{(k)}_{c}-\eta_{\mathrm{u}}^{-1}\mathbf{W}^{(k,c-1)}$ . Then, evaluating (19) at $\overline{\mathbf{W}}^{(k,j-1)}:=(\mathbf{e}_{1:j-1},\mathbf{r},\mathbf{e}_{j+1:C})^{\top}\mathbf{W}^{(k,j-1)}$ , it is clear that

[TABLE]

and that

[TABLE]

Similarly, observe that

[TABLE]

Thus, using the definitions of $\mathbf{K}$ and $\mathbf{b}$ in the statement, $\mathbf{r}_{j}$ is selected by optimizing

[TABLE]

over $\mathbf{r}\in\mathbb{R}^{C}$ . Apply Lem.˜4 to complete the proof. ∎

A.2 Monotonicity

This subsection is dedicated to the proof of Lem.˜2, which is restated at the end. As a broad outline, the proof proceeds via the sequence of inequalities

[TABLE]

where the colors indicate what changes at each step and the underbraces use the notation of the cyclic update in Alg.˜1. The first inequality is immediate because

[TABLE]

The remainder of the work will be to apply the smoothness conditions so that the gradient-based updates result in descent for small enough learning rates.

We may easily see that for fixed $\mathbf{y}=(y_{1},\ldots,y_{m})$ and $\boldsymbol{\theta}$ and the function $G$ defined by

[TABLE]

it holds via the assumptions of Lem.˜2 and the triangle inequality that

[TABLE]

Thus, we used the scaled Lipschitz constant $L_{\mathbf{W}}$ in the proof below.

See 2

Proof.

The result is shown by proving the inequalities in (20). Note the equivalence

[TABLE]

for the function $H(\boldsymbol{\theta}):=\frac{1}{N}\sum_{i=1}^{N}\sum_{m=1}^{M}\ell_{m}(\mathbf{W}^{(k-1)}\mathbf{z}_{i},y_{m},\boldsymbol{\theta}_{m})+\frac{\mu}{2}\left\lVert\boldsymbol{\theta}\right\rVert_{2}^{2}$ . By assumption, we have that $\nabla H$ is $(L_{\boldsymbol{\theta}}+\mu)$ -Lipschitz continuous. Furthermore, the update rule can be summarized as

[TABLE]

or in other words, $\nabla H(\boldsymbol{\theta}^{(k-1)})=\mathbf{g}^{(k)}+\mu\boldsymbol{\theta}^{(k-1)}$ . Apply Nesterov (2018, Theorem 2.1.5, Eq. (2.1.9)222Note that the inequality employed does not require convexity of the objective, despite the theorem assumptions.) to achieve

[TABLE]

where the last inequality follows under the condition that $\eta_{\mathrm{p}}\leq\frac{1}{L_{\boldsymbol{\theta}}+\mu}$ . For the second part of the proof, we show that for $c=1,\ldots,C$ , it holds that

[TABLE]

which can be chained to show the desired result. Because $\boldsymbol{\theta}^{(k)}$ is fixed, we show that $F(\mathbf{W}^{(k,c)},\boldsymbol{\theta}^{(k)},\mathbf{U}^{(k)})\leq F(\mathbf{W}^{(k,c-1)},\boldsymbol{\theta}^{(k)},\mathbf{U}^{(k)})$ . To do so, we use the smoothness constant computation (21) and apply Nesterov (2018, Theorem 2.1.5, Eq. (2.1.9)) once again to achieve

[TABLE]

where in the last inequality we used the definition of $\mathbf{W}^{(k,c)}$ as the minimizer of the objective (7), for which $\mathbf{W}^{(k,c-1)}$ is feasible. Thus, the inequality (22) is satisfied when $\eta_{\mathrm{u}}\leq\frac{1}{2\lambda L_{\mathbf{W}}}$ , completing the proof of monotonicity. If $F$ is bounded from below, then $F_{\mu}$ is as well; applying the monotone convergence theorem to the sequence $(F(\mathbf{W}^{(k)},\boldsymbol{\theta}^{(k)},\mathbf{U}^{(k)}))_{k=1}^{\infty}$ proves the second claim and completes the proof. ∎

Appendix B Convergence to a Stationary Point

In this section, we analyze a modification to the algorithm (see (14)) and a set of technical tools to establish convergence to a stationary point of the objective. To outline the formal statement of Prop.˜3 and proof, we introduce some additional notation and definitions. For any extended real-valued function $\Psi:\mathbb{R}^{D}\rightarrow\mathbb{R}\cup\left\{+\infty\right\}$ , finite at $\mathbf{x}\in\mathbb{R}^{D}$ , we define the Fréchet subdifferential at $\mathbf{x}$ as the set

[TABLE]

which may be empty. An element of this set will be called a Fréchet subgradient. In our case, we consider $\mathbf{x}=(\mathbf{W},\boldsymbol{\theta},\mathbf{U})$ and $D=C^{2}+d+NCT$ (by flattening the tensors appropriately), and the objective (15). We will define stationarity of a sequence $(\mathbf{W}^{(k)},\boldsymbol{\theta}^{(k)},\mathbf{U}^{(k)})_{k\geq 0}$ as there existing a sequence of Fréchet subgradients $G^{(k)}=(G^{(k)}_{\mathbf{U}},G^{(k)}_{\boldsymbol{\theta}},G^{(k)}_{\mathbf{W}})$ such that $\lVert G^{(k)}\rVert_{\mathrm{F}}\rightarrow 0$ as $k\rightarrow\infty$ .

Let us outline the proof. First, given the result of Lem.˜2, it is clear that $F_{\mu}(\mathbf{W}^{(k-1)},\boldsymbol{\theta}^{(k-1)},\mathbf{U}^{(k-1)})-F_{\mu}(\mathbf{W}^{(k)},\boldsymbol{\theta}^{(k)},\mathbf{U}^{(k)})\rightarrow 0$ as $k\rightarrow\infty$ due to monotonicity and boundedness of the sequence. To achieve a stationarity result, we first prove a descent inequality of the form

[TABLE]

which, due to the convergence of the function values, provides the asymptotic result

[TABLE]

as $k\rightarrow\infty$ . In the second step, we compute a particular Fréchet subgradient $(G^{(k)}_{\mathbf{U}},G^{(k)}_{\boldsymbol{\theta}},G^{(k)}_{\mathbf{W}})\in\partial\Psi(\mathbf{U}^{(k)},\boldsymbol{\theta}^{(k)},\mathbf{W}^{(k)})$ , and show that

[TABLE]

for a constant $C_{0}\geq 0$ . Given that $I^{(k)}\rightarrow 0$ , it holds that this subgradient will also converge to zero, providing convergence of the sequence $(\mathbf{W}^{(k)},\boldsymbol{\theta}^{(k)},\mathbf{U}^{(k)})$ to a stationary point. These two broad steps comprise the next two subsections, which culminates in Prop.˜7. The first is used in other analyses of non-convex optimization trajectories (Bolte et al., 2014), which is the boundedness of the iterates.

Assumption 3.

There exists a constant $B>0$ such that

[TABLE]

The second assumption will be similar, although instead of boundedness of the iterates, we require that the $\mathbf{W}^{(k)}$ trajectory does not become “too close” to singularity. In this case, we can place a smoothness condition on the log-determinant function.

Assumption 4.

For a constant $\sigma>0$ , define the set $\mathcal{W}_{\sigma}$ , which is a subset of the invertible matrices, using the following condition. For any $\mathbf{W},\mathbf{W}^{\prime}\in\mathcal{W}_{\sigma}$ and $c=1,\ldots,C$ , it holds that

[TABLE]

Assume that there exists $\sigma>0$ such that $\mathcal{W}_{\sigma}$ is non-empty and that $\mathbf{W}^{(k,c)}\in\mathcal{W}_{\sigma}$ for $k=1,\ldots,K$ and $c=1,\ldots,C$ .

Another interpretation is that we confine the matrices to a sub-level set of the negative log-determinant function. Here, $\sigma$ can also be interpreted as (a reparametrization of) a minimum singular value constant. The final assumption is simply a smoothness assumption on the function $f$ , which first appears in the variational form (3) and consequently the full objective (4).

Assumption 5.

Assume that $f$ is differentiable, and for that any $u,v\geq 0$ , it holds that

[TABLE]

B.1 Descent Inequalities

Here, we reuse the same proof of Lem.˜2, but retain the additional non-positive terms that were dropped as slack for cancellation. While the analysis for $\mathbf{U}^{(k)}$ and $\boldsymbol{\theta}^{(k)}$ are nearly identical to the previous results, we must chain the inequalities for the inner loop of the $\mathbf{W}^{(k)}$ update so that they sum to the desired result.

Lemma 5.

Under the conditions of Lem.˜2, the folloing three inequalities hold:

[TABLE]

Proof.

The first inequality follows immediately from (14). By repeating the argument in the proof of Lem.˜2, we may achieve the two inequalities

[TABLE]

for every step of the inner loop. Using $\eta_{\mathrm{p}}\leq\frac{1}{L_{\boldsymbol{\theta}}+\mu}$ above grants the second inequality in the statement of the result. For the third inequality, using that $\eta_{\mathrm{u}}\leq\frac{1}{2\lambda L_{\mathbf{W}}}$ , we have that

[TABLE]

where the equalities follow from the fact that $\mathbf{W}^{(k,c)}_{j\cdot}=\mathbf{W}^{(k,c-1)}_{j\cdot}$ for $j\neq c$ due to the row constraints and that $\mathbf{W}^{(k,c)}_{c\cdot}=\mathbf{W}^{(k)}_{c\cdot}$ and $\mathbf{W}^{(k,c-1)}_{c\cdot}=\mathbf{W}^{(k-1)}_{c\cdot}$ due to the order of the updates. By chaining this inequality for $c=C,\ldots,1$ , we have that

[TABLE]

∎

By applying the three inequalities from Lem.˜5, we achieve the descent guarantee described as the first step in the proof outline. The convergence (23) follows from the rearrangement

[TABLE]

where the right-hand side converges to zero.

B.2 Subdifferential Inequalities

We now establish (24), in two parts. We introduce some additional notation in this section. Decompose the objective further via

[TABLE]

for

[TABLE]

For the remaining results, we first provide a formula for a Fréchet subgradient. We then prove the inequality (24) to achieve the desired result. In the first part, we identify the Fréchet subgradient that can be used in (24).

Lemma 6.

Define

[TABLE]

for

[TABLE]

where the second line is interpreted as zero when $c>M$ . Then,

[TABLE]

Proof.

By (14), it holds by the optimality of $\mathbf{U}^{(k)}$ that

[TABLE]

for some $\mathbf{S}^{(k)}\in\partial\iota_{+}(\mathbf{U}^{(k)})$ . It also holds that

[TABLE]

which, combined with the above, yields

[TABLE]

By the definition of $\boldsymbol{\theta}^{(k)}$ , it holds that

[TABLE]

It holds trivially that $\nabla_{\boldsymbol{\theta}}F_{\mu}(\mathbf{W}^{(k)},\boldsymbol{\theta}^{(k)},\mathbf{U}^{(k)})+\operatorname{\mathbf{0}}_{d}\in\partial_{\boldsymbol{\theta}}\Psi(\mathbf{U}^{(k)},\boldsymbol{\theta}^{(k)},\mathbf{W}^{(k)})$ and substituting the expression for $\operatorname{\mathbf{0}}_{d}$ then proves that $G^{(k)}_{\boldsymbol{\theta}}\in\partial_{\boldsymbol{\theta}}\Psi(\mathbf{U}^{(k)},\boldsymbol{\theta}^{(k)},\mathbf{W}^{(k)})$ . Finally, for any vector $\mathbf{w}\in\mathbb{R}^{C}$ , we use $\mathbf{W}^{k,c-1}(\mathbf{w})$ to indicate the matrix $\mathbf{W}^{k,c-1}$ with its $c$ -th row replaced by $\mathbf{w}$ . Due to the definition

[TABLE]

we have the identity

[TABLE]

Then, because

[TABLE]

we have that

[TABLE]

is equal to $\nabla_{\mathbf{W}_{c\cdot}}\Psi(\mathbf{U}^{(k)},\boldsymbol{\theta}^{(k)},\mathbf{W}^{(k)})$ . To simplify the expression for the unsupervised loss, note that because $\mathbf{W}^{(k)}_{c\cdot}=\mathbf{W}^{(k,c)}_{c\cdot}$ , it holds that

[TABLE]

For the supervised loss, defining the expression below as zero for $c>M$ , because $\mathbf{W}^{(k,c-1)}_{c\cdot}=\mathbf{W}^{(k-1)}_{c\cdot}$ , we have that

[TABLE]

By combining all three steps with the subdifferential calculus rule

[TABLE]

we complete the proof. ∎

To complete the analysis, we upper bound the norm of the subgradient

[TABLE]

This is where the key assumptions are used.

Proposition 7.

Assume the conditions of Lem.˜2 and additionally, Asms.˜3, 4 and 5. Let $L=(L_{\mathbf{U}},L_{\boldsymbol{\theta}},L_{\mathbf{W}})$ , $\eta=(\eta_{\mathrm{a}},\eta_{\mathrm{p}},\eta_{\mathrm{u}})$ , and $\mathbf{Z}=(\mathbf{z}_{1},\ldots,\mathbf{z}_{N})$ . Then, there exists a constant

[TABLE]

such that for $k=1,\ldots,K$ , it holds that

[TABLE]

Proof.

Starting from (28), we bound each norm using the formulas from Lem.˜6. In the following, we use $C_{i}\equiv C_{i}(L,\eta,\mathbf{Z},\mu,\sigma)\geq 0$ as a constant. Recall the function $f$ from (3). Under Asm.˜3 and Asm.˜5, it holds that

[TABLE]

Next, under Asm.˜3 and the smoothness conditions of Lem.˜2, we have

[TABLE]

where in the bound on the third term, we use the mixed smoothness constant $\bar{L}$ from Asm.˜2. Finally, under Asm.˜3 and Asm.˜4, we have

[TABLE]

Combine the three bounds and the inequality $(a+b+c)^{2}\leq 3(a^{2}+b^{2}+c^{2})$ to achieve the desired result. ∎

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ablin et al. (2019) P. Ablin, A. Gramfort, J.-F. Cardoso, and F. Bach. Stochastic algorithms with descent guarantees for ICA. In AISTATS , 2019.
2Amari (1999) S.-I. Amari. Natural gradient learning for over- and under-complete bases in ICA. Neural Computation , 1999.
3Bell and Sejnowski (1995) A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural Computation , 1995.
4Bolte et al. (2014) J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming , 2014.
5Brendel and Kellermann (2021) A. Brendel and W. Kellermann. Accelerating Auxiliary Function-Based Independent Vector Analysis. In ICASSP 2021 , 2021.
6Brendel et al. (2020) A. Brendel, T. Haubner, and W. Kellermann. Spatially Guided Independent Vector Analysis. In ICASSP , 2020.
7Cardoso (1989) J.-F. Cardoso. Source Separation using Higher Order Moments. In ICASSP , 1989.
8Cardoso and Souloumiac (1993) J.-F. Cardoso and A. Souloumiac. Blind Beamforming for Non Gaussian Signals. In IEEE Proceedings F (Radar and Signal Processing) , 1993.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Supervised Stochastic Gradient Algorithms

Abstract

1 Introduction

Contributions

Related Work

2 Methods

Variational Objective

Assumption 1** (Super-Gaussian Likelihood).**

Auxiliary Variables and Model Parameters Update

Unmixing Matrix Update

Proposition 1**.**

Convergence Analysis

Assumption 2**.**

Lemma 2**.**

Proposition 3**.**

Full Batch and Stochastic Variants

3 Experimental Results

Effect of Multiple Trials

Effect of Supervision

Application to Neural Data

4 Conclusion

Acknowledgments

Appendix A Descent Guarantees

A.1 Derivations

Lemma 4**.**

Proof.

Proof.

A.2 Monotonicity

Proof.

Appendix B Convergence to a Stationary Point

Assumption 3**.**

Assumption 4**.**

Assumption 5**.**

B.1 Descent Inequalities

Lemma 5**.**

Proof.

B.2 Subdifferential Inequalities

Lemma 6**.**

Proof.

Proposition 7**.**

Proof.

Assumption 1 (Super-Gaussian Likelihood).

Proposition 1.

Assumption 2.

Lemma 2.

Proposition 3.

Lemma 4.

Assumption 3.

Assumption 4.

Assumption 5.

Lemma 5.

Lemma 6.

Proposition 7.