Faster independent component analysis by preconditioning with Hessian   approximations

Pierre Ablin; Jean-Fran\c{c}ois Cardoso; Alexandre Gramfort

arXiv:1706.08171·stat.ML·August 1, 2018·IEEE Trans. Signal Process.

Faster independent component analysis by preconditioning with Hessian approximations

Pierre Ablin, Jean-Fran\c{c}ois Cardoso, Alexandre Gramfort

PDF

2 Repos

TL;DR

This paper introduces Picard, a preconditioned L-BFGS algorithm for ICA that uses Hessian approximations, significantly improving speed and accuracy on large, real-world datasets.

Contribution

The paper presents a novel preconditioning approach for ICA using Hessian approximations, enhancing convergence speed and robustness on real data.

Findings

01

Picard outperforms existing ICA algorithms in speed and accuracy.

02

The method is particularly effective on real datasets where ICA assumptions are relaxed.

03

Extensive numerical experiments validate the superiority of the proposed approach.

Abstract

Independent Component Analysis (ICA) is a technique for unsupervised exploration of multi-channel data that is widely used in observational sciences. In its classic form, ICA relies on modeling the data as linear mixtures of non-Gaussian independent sources. The maximization of the corresponding likelihood is a challenging problem if it has to be completed quickly and accurately on large sets of real data. We introduce the Preconditioned ICA for Real Data (Picard) algorithm, which is a relative L-BFGS algorithm preconditioned with sparse Hessian approximations. Extensive numerical comparisons to several algorithms of the same class demonstrate the superior performance of the proposed technique, especially on real data, for which the ICA model does not necessarily hold.

Equations29

p (X ∣ A) = t = 1 \prod T \frac{1}{∣ det ( A )∣} i = 1 \prod N p_{i} ([A^{- 1} x]_{i} (t)) .

p (X ∣ A) = t = 1 \prod T \frac{1}{∣ det ( A )∣} i = 1 \prod N p_{i} ([A^{- 1} x]_{i} (t)) .

L (W) = - lo g ∣ det (W)∣ - \hat{E} [i = 1 \sum N lo g (p_{i} (y_{i} (t))],

L (W) = - lo g ∣ det (W)∣ - \hat{E} [i = 1 \sum N lo g (p_{i} (y_{i} (t))],

L ((I + E) W) = L (W) + ⟨ G ∣ E ⟩ + \frac{1}{2} ⟨ E ∣ H ∣ E ⟩ + O (∣∣ E ∣ ∣^{3}) .

L ((I + E) W) = L (W) + ⟨ G ∣ E ⟩ + \frac{1}{2} ⟨ E ∣ H ∣ E ⟩ + O (∣∣ E ∣ ∣^{3}) .

lo g ∣ det (I + E)∣

lo g ∣ det (I + E)∣

lo g (p_{i} (y + e))

G_{ij} = \hat{E} [ψ_{i} (y_{i}) y_{j}] - δ_{ij} or G (Y) = \frac{1}{T} ψ (Y) Y^{⊤} - I d

G_{ij} = \hat{E} [ψ_{i} (y_{i}) y_{j}] - δ_{ij} or G (Y) = \frac{1}{T} ψ (Y) Y^{⊤} - I d

H_{ij k l} = δ_{i l} δ_{j k} + δ_{ik} \hat{E} [ψ_{i}^{'} (y_{i}) y_{j} y_{l}] .

H_{ij k l} = δ_{i l} δ_{j k} + δ_{ik} \hat{E} [ψ_{i}^{'} (y_{i}) y_{j} y_{l}] .

⎩ ⎨ ⎧ \hat{h}_{ij l} \hat{h}_{ij} \hat{h}_{i} \overset{σ}{^}_{i}^{2} = \hat{E} [ψ_{i}^{'} (y_{i}) y_{j} y_{l}], for 1 \leq i, j, l \leq N = \hat{E} [ψ_{i}^{'} (y_{i}) y_{j}^{2}], for 1 \leq i, j \leq N = \hat{E} [ψ_{i}^{'} (y_{i})], for 1 \leq i \leq N = \hat{E} [y_{i}^{2}], for 1 \leq i \leq N .

⎩ ⎨ ⎧ \hat{h}_{ij l} \hat{h}_{ij} \hat{h}_{i} \overset{σ}{^}_{i}^{2} = \hat{E} [ψ_{i}^{'} (y_{i}) y_{j} y_{l}], for 1 \leq i, j, l \leq N = \hat{E} [ψ_{i}^{'} (y_{i}) y_{j}^{2}], for 1 \leq i, j \leq N = \hat{E} [ψ_{i}^{'} (y_{i})], for 1 \leq i \leq N = \hat{E} [y_{i}^{2}], for 1 \leq i \leq N .

\tilde{H}_{ij k l}^{2} = δ_{i l} δ_{j k} + δ_{ik} δ_{j l} \hat{h}_{ij} .

\tilde{H}_{ij k l}^{2} = δ_{i l} δ_{j k} + δ_{ik} δ_{j l} \hat{h}_{ij} .

{\tilde{H}_{ij k l}^{1} \tilde{H}_{iiii}^{1} = δ_{i l} δ_{j k} + δ_{ik} δ_{j l} \hat{h}_{i} \overset{σ}{^}_{j}^{2}, for i \neq = j = 1 + \hat{h}_{ii} .

{\tilde{H}_{ij k l}^{1} \tilde{H}_{iiii}^{1} = δ_{i l} δ_{j k} + δ_{ik} δ_{j l} \hat{h}_{i} \overset{σ}{^}_{j}^{2}, for i \neq = j = 1 + \hat{h}_{ii} .

[\tilde{H}^{- 1} G]_{ij} = \frac{a _{j i} G _{ij} - G _{j i}}{a _{ij} a _{j i} - 1} for i \neq = j .

[\tilde{H}^{- 1} G]_{ij} = \frac{a _{j i} G _{ij} - G _{j i}}{a _{ij} a _{j i} - 1} for i \neq = j .

λ_{ij} = \frac{1}{2} (a_{ij} + a_{j i} - (a_{ij} - a_{j i})^{2} + 4),

λ_{ij} = \frac{1}{2} (a_{ij} + a_{j i} - (a_{ij} - a_{j i})^{2} + 4),

W \leftarrow (I - α G) W,

W \leftarrow (I - α G) W,

(H M)_{ij} = k, l \sum H_{ij k l} M_{k l} = M_{j i} + \hat{E} [ψ_{i}^{'} (y_{i}) y_{j} l \sum M_{i l} y_{l}]

(H M)_{ij} = k, l \sum H_{ij k l} M_{k l} = M_{j i} + \hat{E} [ψ_{i}^{'} (y_{i}) y_{j} l \sum M_{i l} y_{l}]

H M = M^{⊤} + \frac{1}{T} [ψ^{'} (Y) \cdot (M Y)] Y^{⊤}

H M = M^{⊤} + \frac{1}{T} [ψ^{'} (Y) \cdot (M Y)] Y^{⊤}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsIndependent Component Analysis

Full text

Faster independent component analysis by preconditioning with Hessian approximations

Pierre Ablin , Jean-Francois Cardoso , and Alexandre Gramfort P. Ablin works at Inria, Parietal Team, Université Paris-Saclay, Saclay, France; e-mail: [email protected]. F. Cardoso is with the Institut d’Astrophysique de Paris, CNRS, Paris, France; e-mail: [email protected]. Gramfort is with Inria, Parietal Team, Université Paris-Saclay, Saclay, France; e-mail: [email protected]

Abstract

Independent Component Analysis (ICA) is a technique for unsupervised exploration of multi-channel data that is widely used in observational sciences. In its classic form, ICA relies on modeling the data as linear mixtures of non-Gaussian independent sources. The maximization of the corresponding likelihood is a challenging problem if it has to be completed quickly and accurately on large sets of real data. We introduce the Preconditioned ICA for Real Data (Picard) algorithm, which is a relative L-BFGS algorithm preconditioned with sparse Hessian approximations. Extensive numerical comparisons to several algorithms of the same class demonstrate the superior performance of the proposed technique, especially on real data, for which the ICA model does not necessarily hold.

**Keywords : ** Independent Component Analysis, Blind source separation, quasi-Newton methods, maximum likelihood estimation, second order methods, preconditioning.

1 Introduction

Independent Component Analysis (ICA) [1, 2] is a multivariate data exploration tool massively used across scientific disciplines such as neuroscience [3, 4, 5, 6], astronomy [7, 8, 9], chemistry [10, 11] or biology [12, 13]. The underlying assumption of ICA is that the data are obtained by combining latent components which are statistically independent. The linear ICA problem addresses the case where latent variables and observations are linked by a linear transform. Then, ICA boils down to estimating a linear transform of the input signals into ‘source signals’ which are as independent as possible.

The strength and wide applicability of ICA come from its limited number of assumptions. For ICA to become a well-posed problem it is only required that all sources except one are non-Gaussian and statistically independent [1]. The generality of this concept explains the usefulness of ICA in many domains.

An early and popular ICA algorithm is Infomax [14]. It is widely used in neuroscience and is distributed in most neural processing toolboxes (e.g. EEGLAB [15] or MNE [16]). It can be shown to be a maximum likelihood estimator [17] based on a non Gaussian component model. However, Infomax maximizes the likelihood using a stochastic gradient algorithm which may require some hand-tuning and often fails to converge [18], or only converges slowly.

Since speed is important in data exploration, various methods have been proposed for a faster maximization of the Infomax likelihood by using curvature information, that is by exploiting not only the gradient of the likelihood as in Infomax but also its second derivatives. We briefly review some of the methods found in the literature.

The most natural way of using curvature is to use the complete set of second derivatives (the Hessian) to set up the Newton method but it faces several difficulties: the Hessian is a large object, costly to evaluate and to invert for large data sets. It also has to be regularized since the ICA likelihood is not convex. The cost issue is addressed in [19] by using a truncated Newton algorithm: an approximate Newton direction is found by an early stopping (truncation) of its computation via a conjugate gradient method. Further, each step in this computation is quickly computed by a ‘Hessian-free’ formula. Another approach to exploit curvature is to use approximations of the Hessian, obtained by assuming that the current signals are independent (see e.g. [20, 21] or section 2). For instance, a simple quasi-Newton method is proposed in [22] and in AMICA [23], and a trust-region algorithm in [24].

We have re-implemented and compared these methods (see section 5) and found that the Hessian approximations do yield a low cost per iteration but that they are not accurate enough on real data (which cannot be expected to follow the ICA model at high accuracy, e.g. in presence of some correlated sources). The approach investigated in this article overcomes this problem by using an optimization algorithm which ‘learns’ curvature from the past iterations of the solver (L-BFGS [25]), and accelerates it by preconditioning with Hessian approximations.

This article is organized as follows. In section 2, we recall the expression of the gradient and Hessian of the log-likelihood. We show how simple Hessian approximations can be obtained and regularized. That allows the L-BFGS method to be preconditioned at low cost yielding the Preconditioned ICA for Real Data (Picard) algorithm described in section 3. In section 4, we detail related algorithms mentioned in the introduction. Finally, section 5 illustrates the superior behavior of the Picard algorithm by extensive numerical experiments on synthetic signals, on multiple electroencephalography (EEG) datasets, on functional Magnetic Resonance Imaging (fMRI) data and on natural images.

2 Likelihood and derivatives

Notation

The Frobenius matrix scalar product is denoted by $\langle M\lvert M^{\prime}\rangle=\operatorname{Tr}(M^{\top}M^{\prime})=\sum_{i,j}M_{ij}M^{\prime}_{ij}$ , and $\|M\|=\sqrt{\langle M\lvert M\rangle}$ is the associated Frobenius matrix norm. Let $B$ be a fourth order tensor of size $N\times N\times N\times N$ . Its application to a $N\times N$ matrix $M$ is denoted $BM$ , a $N\times N$ matrix with entries $(BM)_{ij}=\sum_{k,l}B_{ijkl}M_{kl}$ . We also denote $\langle M^{\prime}\lvert B\rvert M\rangle=\langle M^{\prime}\lvert BM\rangle=\sum_{i,j,k,l}B_{ijkl}M^{\prime}_{ij}M_{kl}$ . The Kronecker symbol $\delta_{ij}$ equals $1$ for $i=j$ and [math] otherwise.

The complexity of an operation is said to go as $\Theta(f(N,T))$ for a real function $f$ if there exist two constants $0<c_{1}<c_{2}$ such that the cost of that operation is in the interval $[c_{1}f(N,T),c_{2}f(N,T)]$ for all $T,N$ .

2.1 Non Gaussian likelihood for ICA

The ICA likelihood for a data set $X=[x_{1},..,x_{N}]^{\top}\in\mathbb{R}^{N\times T}$ of $N$ signals $x_{1},\dots,x_{N}$ of length $T$ is based on the linear model $X=AS$ where the $N\times N$ mixing matrix $A$ is unknown and the source matrix $S$ has $N$ statistically independent zero-mean rows. If each row of $S$ is modeled as i.i.d. with $p_{i}(\cdot)$ denoting the common distribution of the samples of the $i$ th source, the likelihood of $A$ is then [20]:

[TABLE]

It is convenient to work with the negative averaged log-likelihood parametrized by the unmixing matrix $W=A^{-1}$ , that is, $\mathcal{L}(W)=-\frac{1}{T}\log p(X\lvert W^{-1})$ . It is given by:

[TABLE]

where $\hat{E}$ denotes the empirical mean (sample average) and where, implicitly, $Y=WX$ . Our aim is to minimize $\mathcal{L}(W)$ with respect to $W$ which amounts to solving the ICA problem in the maximum likelihood sense.

We note from the start that this optimization problem is not convex for a simple reason: if $W^{*}$ minimizes the objective function, any permutation of the columns of $W^{*}$ gives another equivalent minimizer.

In this paper, we focus on fast and accurate minimization of $\mathcal{L}(W)$ for a given source model, that is, working with fixed predetermined densities $p_{i}$ . It corresponds to the standard Infomax model commonly used in practice. In particular, our experiments use $-\log(p_{i}(\cdot))=2\log(\cosh(\cdot/2))+\mathrm{cst}$ , which is the density model in standard Infomax.

In the following, the ICA mixture model is said to hold if the signals actually are a mixture of independent components. We stress that on real data, the ICA mixture model is not expected to hold exactly.

2.2 Relative variations of the objective function

The variation of $\mathcal{L}(W)$ with respect to a relative variation of $W$ is described, up to second order, by the Taylor expansion of $\mathcal{L}((I+\mathcal{E})W)$ in terms of a ‘small’ $N\times N$ matrix $\mathcal{E}$ :

[TABLE]

The first order term is controlled by the $N\times N$ matrix $G$ , called the relative gradient [26] and the second-order term depends on the $N\times N\times N\times N$ tensor $H$ , called the relative Hessian [22]. Both these quantities can be obtained from the second order expansions of $\log\det(\cdot)$ and $\log p_{i}(\cdot)$ :

[TABLE]

where $\psi_{i}=-\frac{p_{i}^{\prime}}{p_{i}}$ is called the score function (equal to $\tanh(\cdot/2)$ for the standard Infomax density). Collecting and re-arranging terms yields at first-order the classic expression

[TABLE]

and, at second order, the relative Hessian:

[TABLE]

Note that the relative Hessian is sparse. Indeed, it has only of the order of $N^{3}$ non-zero coefficients: $\delta_{il}\delta_{jk}\neq 0$ for $i=l$ and $j=k$ which corresponds to $N^{2}$ coefficients, and $\delta_{ik}\neq 0$ for $i=k$ which happens $N^{3}$ times. This means that for a practical application with 100 sources the Hessian easily fits in memory. However, its computation requires the evaluation of the terms $\hat{E}[\psi_{i}^{\prime}(y_{i})y_{j}y_{l}]$ , resulting in a $\Theta(N^{3}\times T)$ complexity. This fact and the necessity of regularizing the Hessian (which is not necessarily positive definite) in Newton methods motivate the consideration of Hessian approximations which are faster to compute and easier to regularize.

2.3 Hessian Approximations

The Hessian approximations are discussed on the basis of the following moments:

[TABLE]

Hence, the true relative Hessian is $H_{ijkl}=\delta_{il}\delta_{jk}+\delta_{ik}\hat{h}_{ijl}$ . A first approximation of $H$ consists in replacing $\hat{h}_{ijl}$ by $\delta_{jl}\hat{h}_{ij}$ . We denote that approximation by $\tilde{H}^{2}$ :

[TABLE]

A second approximation, denoted $\tilde{H}^{1}$ , goes one step further and replaces $\hat{h}_{ij}$ by $\hat{h}_{i}\hat{\sigma}_{j}^{2}$ for $i\neq j$ :

[TABLE]

Those two approximations are illustrated on Fig 1. A key feature is that both approximations are block diagonal. Denoting $\tilde{H}$ for either $\tilde{H}^{1}$ or $\tilde{H}^{2}$ , we note that, for $i\neq j$ , the only non-zero coefficients in $\tilde{H}_{ijkl}$ are for $(k,l)=(i,j)$ and $(k,l)=(j,i)$ . The coefficients $\tilde{H}_{ijji}$ are equal to $1$ .

When the signals are independent, $\hat{h}_{ijl}=\delta_{jl}\hat{h}_{ij}=\delta_{jl}\hat{h}_{i}\hat{\sigma}_{j}^{2}$ asymptotically in $T$ for $i\neq j$ . This, together with $\hat{h}_{iii}=\hat{h}_{ii}$ , means that the two approximations asymptotically match the true Hessian if the signals are independent. In particular, if an iterative algorithm converges to a solution on a problem where the ICA mixture model holds, the Hessian approximations get very close to the true Hessian of the objective function.

Away from convergence or if the ICA mixture model does not hold, one cannot expect those approximations to be very accurate. This is why we use them only as a preconditioners for our algorithm. They enjoy two properties which are critical in that respect, being fast to compute and easy to invert. Indeed, computing $\tilde{H}^{1}$ or $\tilde{H}^{2}$ is less costly than computing $H$ . Evaluating $\tilde{H}^{2}$ requires the computation of $\hat{h}_{ij}$ for all $(i,j)$ , which is of order $\Theta(N^{2}\times T)$ . Obtaining $\tilde{H}^{1}$ is even faster, because it only requires the computation of the $\hat{h}_{i}$ and $\hat{\sigma}_{i}$ , that is, $\Theta(N^{1}\times T)$ . Furthermore, the $\tilde{H}$ approximations are not only sparse: their block diagonal structure also allows $\tilde{H}^{-1}G$ to be computed quickly in close form. Indeed, defining $a_{ij}=\tilde{H}_{ijij}$ , elementary linear algebra shows that

[TABLE]

Hence, computing $\tilde{H}^{-1}G$ has complexity $\Theta(N^{2})$ .

2.4 Regularization of Hessian Approximations

Like the true Hessian, the Hessian approximations have no reason to be positive definite. This means that we have to set up a regularization procedure.

That can be done at little cost since the two Hessian approximations can be diagonalized in close form by diagonalizing each of the $2\times 2$ blocks. The smallest eigenvalue for the block $(i,j)$ is readily found to be:

[TABLE]

with, again, $a_{ij}=\tilde{H}_{ijij}$ , for either $\tilde{H}=\tilde{H}^{1}$ or $\tilde{H}=\tilde{H}^{2}$ .

Based on this, we propose the simple regularization procedure detailed in Algorithm 1: the blocks with positive and large enough eigenvalues are left untouched, while the other blocks have their spectrum shifted so that their smallest eigenvalue is equal to a prescribed minimum value $\lambda_{min}$ .

3 Preconditioned ICA for Real Data

Quasi-Newton methods attempt to estimate the local curvature of the objective function without explicitly computing its Hessian [27]. Indeed, popular methods such as DFP [28, 29, 30] or BFGS [31, 29, 32, 33] build an approximation of the Hessian using solely function and gradient evaluations performed during optimization.

The popular L-BFGS [25] algorithm is used in many practical applications and obtains good results on a wide variety of problems. Rather than storing all the updates of the Hessian approximation leading to a dense matrix potentially too big to fit in memory like BFGS does, L-BFGS only stores the last $m$ updates and then relies on a recursive inversion algorithm. The integer $m$ is referred to as the memory parameter. The algorithm starts from an initial guess for the Hessian which is easily invertible, and builds on it by adding low rank updates from the $m$ previous iterates and gradient evaluations. For lack of a better choice, vanilla L-BFGS uses a multiple of the identity matrix for the initial Hessian guess. However, if some Hessian approximation is available, it could be used instead. This is tantamount to preconditioning the problem with the said Hessian approximation [34].

The Hessian approximations $\tilde{H}$ provide us with a very effective preconditioning, as shown below in Sec. 5, resulting in the ‘Preconditioned ICA for Real Data’ (Picard) algorithm. Picard exploits the Hessian approximations to initialize the recursive formula of L-BFGS. It is summarized in algorithms 2 and 3. We use the same notations as in [27]: $y_{i}=G_{i}-G_{i-1}$ , $s_{i}=a_{i}p_{i}$ (this is the “relative” update of the unmixing matrix between two iterations) and $\rho_{i}=1/\langle s_{i}\lvert y_{i}\rangle$ . As in standard L-BFGS algorithm, the search direction $p_{k}$ is computed using recursive algorithm with two for loops, however the initial guess for the Hessian is here set to $\tilde{H}_{k}$ .

Line search

The algorithm relies on a line search procedure which aims at finding a good step size $\alpha$ at each iteration. In theory, the line search procedure has to enforce Wolfe conditions [35, 27] in order to guarantee convergence. The line search procedure proposed by Moré and Thuente [36] is generally considered to be an efficient way to enforce such conditions. It is based upon cubic interpolation of the objective function in the direction of interest. Yet, for each candidate step size, one must compute the values of the objective function and of the gradient, which can be costly.

A simpler line search strategy is backtracking. If, for $\alpha=1$ , the objective function is decreased, then that value is retained, otherwise the step size is divided by a factor of $2$ and the process is repeated. This method only requires one evaluation of the likelihood at each step size, but it does not enforce Wolfe conditions.

In practice, backtracking is stopped when $\alpha$ becomes too small, which is an indication that the objective function has a pathological behavior in the search direction, since we rather expect values of the order of the “Newton value” $\alpha=1$ . In the case of too many backtracking steps, resulting in too small a step size, the algorithm would not move much, and might get stuck for a long time in that problematic zone. Therefore, after a fixed number of failed backtracking step, the L-BFGS descent direction is deemed inefficient and we fall back to descending along the relative gradient direction, and reset the memory (we found that to happen quite infrequently in our experiments).

4 Related work

We compare our approach to the algorithms mentioned in section 1. Some classical ICA algorithms such as FastICA [37], JADE [38] or Kernel ICA [39] are not included in the comparison because they do not optimize the same criterion.

4.1 Gradient descent

The gradient is readily available and directly gives an update rule for a gradient descent algorithm:

[TABLE]

where $\alpha>0$ is a step size found by line search or an annealing policy. In the experiments, we used an oracle line-search: at each step, we find a very good step size using a costly line-search, but do not take into account the time taken, as if the sequence of best step sizes were readily available. This algorithm is referred to as ”Oracle gradient descent”.

4.2 Infomax

We now give a brief explanation on how the Infomax [14] algorithm actually runs. It is a stochastic version of rule (11): at each iteration of the algorithm, a relative gradient $G^{\prime}$ is computed from a ‘mini-batch’ of $T^{\prime}\ll T$ randomly selected samples and a relative update $W\leftarrow(I-\alpha G^{\prime})W$ is performed.

The stochasticity of Infomax has benefits and drawbacks. For a thorough review about what stochasticity brings, see [40]. In summary, on the good side, stochasticity accelerates the first few passes on the full data because the objective starts decreasing after only one mini batch has been used, while for a full batch algorithm like the one presented above, it takes a full pass on the whole data to start making progress. Furthermore, if the number of samples is very large, computing the gradient using the whole dataset might be too costly, and then resorting to stochastic techniques is one way of coping with the issue.

Stochasticity, however, also comes with some disadvantages. The first one is that a plain stochastic gradient method with fixed batch size needs a very careful annealing policy for the learning rate to converge to a local minimum of the objective function. In practice, across iterations, the true gradient computed with the full set will not go to [math], but instead will reach a plateau.

This is directly linked to the choice of the step size. If it is too small the algorithm will not make much progress, and if it is too large, the algorithm will become unstable. In fact, the level of the plateau reached by the gradient is proportional to the step size [40]. Line search techniques are also unpractical, because one has only access to noisy realizations of the gradient and of the objective if one works only on a mini-batch of samples. In practice, the standard Infomax implementation relies on heuristics. It starts from a given step size $\alpha_{0}$ , and decreases it by a factor $\rho$ if the angle between two successive search directions is greater than some constant $\theta$ . That makes $3$ parameters that have to be set correctly, which may be problematic [18].

4.3 Truncated Newton’s method

As explained above, direct Newton’s method is quite costly. The so-called truncated Newton method [27] manages to obtain directions similar to Newton’s method at a fraction of the cost. The idea is to compute $H^{-1}G$ using the conjugate gradient method [27] which does not require the construction of $H$ but only a mean of computing a Hessian-vector product $HM$ . In the ICA problem, there is an efficient way to do so, a Hessian free product. Indeed, using expression (5) of the true Hessian, one finds:

[TABLE]

or in a matrix form:

[TABLE]

where $\cdot$ is the element-wise matrix product. This computation comes at roughly the same cost as a gradient evaluation, its complexity being $\Theta(N^{2}\times T)$ .

The idea of truncated Newton’s method is that instead of perfectly computing $H^{-1}G$ using conjugate gradient, one can stop the iterations before termination, at a given error level, and still obtain a useful approximation of Newton’s direction.

This method is applied to ICA in [19] where the authors also use a stochastic framework with variable batch size, speeding up the algorithm during the first steps. We did not implement such a strategy in order to have a fairer comparison.

One way of incorporating Hessian approximations in this method (not implemented in [19]) is to use them once again as preconditioners for the linear conjugate gradient. We found that this idea roughly halves the number of conjugate gradient iterations for a given error in solving $H^{-1}G$ .

A difficulty arising with this method is the Hessian regularization. Because it avoids the computation of the Hessian, finding its smallest eigenvalue is not straightforward, and heuristics have to be used, like in [19]. However, we do not want these hand tuned parameters to bias the algorithm comparison. Hence, in our implementation of the algorithm, we compute $H$ and its smallest eigenvalue $\lambda_{m}$ but we do not include the associated cost in the timing of the algorithm. Then, we regularize $H$ by adding $-2\lambda_{m}Id$ to it if $\lambda_{m}<0$ .

These steps are summarized in algorithm 4 in which the step marked with a $(*)$ is not counted in the final timing.

4.4 Simple Quasi-Newton method

The simplest way to take advantage of the Hessian approximations is to use them as replacement of $H$ in Newton algorithm. The descent direction is then given by $-\tilde{H}^{-1}G$ . We will refer to this as the simple quasi-Newton method, which is detailed in Algorithm 5. Note that any Hessian approximation can be used as long as it is guaranteed to be positive definite. This optimization algorithm is used in [22] with $\tilde{H}^{2}$ (however, the regularization technique differs from our implementation), and in [41] with $\tilde{H}^{1}$ .

In the experiments, we refer to this algorithm as “Simple quasi-Newton H2” and “Simple quasi-Newton H1” where we respectively use $\tilde{H}^{2}$ and $\tilde{H}^{1}$ as approximations.

4.5 Trust-region method

Another way to proceed is to use a trust-region algorithm [27], rather than a line-search strategy. It is the idea proposed in [24], where $\tilde{H}^{2}$ is used to build a local quadratic approximation of the objective, and then minimization is done with a trust region update. In the experiments, we denote this algorithm by “Trust region ICA”. For the experiments, we used a direct translation of the author’s code in Python, which produces the same iterations as the original code.

5 Experiments

We went to great lengths to ensure a fair comparison of the above algorithms. We reimplemented each algorithm using the Python programming language. Since the most costly operations are by far those scaling with the number of samples (i.e. evaluations of the likelihood, score and its derivative, gradient, Hessian approximations and Hessian free products), we made sure that all implementations call the same functions, thereby ensuring that differences in convergence speed are caused by algorithmic differences rather than by details of implementation.

Our implementations of Picard, the simple quasi-Newton method, the truncated Newton method and the trust region method are available online111https://github.com/pierreablin/faster-ica.

5.1 Experimental setup

All the following experiments have been performed on the same computer using only one core of an Intel Core i7-6600U @ 2.6 GHz. For optimized numerical code we relied on Numpy [42] using Intel MKL as the linear algebra backend library, and the numexpr package222 https://github.com/pydata/numexpr to optimize CPU cache. It was particularly efficient in computing $\log\cosh(y_{i}(t)/2)$ and $\tanh(y_{i}(t)/2)\enspace\forall i,t$ .

For each ICA experiment, we keep track of the gradient infinite norm (defined as $\max_{ij}\lvert G_{ij}\rvert$ ) across time and across iterations. The algorithms are stopped if a certain predefined number of iterations is exceeded or if the gradient norm reaches a small threshold (typically $10^{-8}$ ).

Each experiment is repeated a certain number of times to increase the robustness of the conclusions. We end up with several gradient norm curves. On the convergence plots, we only display the median of these curves to maintain readability: half experiments finished faster and the other half finished slower than the plotted curve.

Besides the algorithms mentioned above, we have run a vanilla version of the L-BFGS algorithm, and Picard algorithm using $\tilde{H}^{1}$ and $\tilde{H}^{2}$ .

5.2 Preprocessing

A standard preprocessing for ICA is applied in all our experiments, as follows. Any given input matrix $X$ is first mean-corrected by subtracting to each row its mean value. Next, the data are whitened by multiplication by a matrix inverse square root of the empirical covariance matrix $C=\frac{1}{T}XX^{\top}$ . After whitening, the covariance matrix of the transformed signals is the identity matrix. In other words, the signals are decorrelated and scaled to unit variance.

5.3 Simulation study

In this first study, we present results obtained on synthetic data. The general setup is the following: we choose the number of sources $N$ , the number of samples $T$ and a probability density for each source. For each of the $N$ densities, we draw $T$ independent samples. Then, a random mixing matrix whose entries are normally distributed with zero mean and unit variance is created. The synthetic signals are obtained by multiplying the source signals by the mixing matrix and preprocessed as described in section (5.2).

We repeat the experiments $100$ times, changing for each run the seed generating the random signals.

We considerd $3$ different setups:

•

Experiment A: $T=10000$ independent samples of $N=50$ independent sources. All sources are drawn with the same density $p(x)=\frac{1}{2}\exp(-\lvert x\rvert)$ .

•

Experiment B: $T=10000$ independent samples of $N=15$ independent sources. The 5 first sources have density $p(x)=\exp(-\lvert x\rvert)$ , the 5 next sources are Gaussian, and the 5 last sources have density $p(x)\propto\exp(-\lvert x^{3}\rvert)$ .

•

Experiment C: $T=5000$ independent samples of $N=40$ independent sources. The $i$ th source has density $p_{i}=\alpha_{i}\mathcal{N}(0,1)+(1-\alpha_{i})\mathcal{N}(0,\sigma^{2})$ where $\sigma=0.1$ and $\alpha_{i}$ is a sequence of linearly spaced values between $\alpha_{1}=0.5$ and $\alpha_{n}=1$ .

In experiment A, the ICA assumption holds perfectly, and each source has a super Gaussian density, for which the choice $\psi=\tanh(\cdot/2)$ is appropriate. In experiment B, the first 5 sources can be recovered by the algorithms for the same reason. However, the next 5 sources cannot because they are Gaussian, and the last 5 sources cannot be recovered either because they are sub-Gaussian. Finally, in experiment C, the mixture is identifiable but, because of the limited number of samples, the most Gaussian sources cannot be distinguished from an actual Gaussian signal. The results of the three experiments are shown in Figure 2.

5.4 Experiments on EEG data

Our algorithms were also run on $13$ publicly available333https://sccn.ucsd.edu/wiki/BSSComparison EEG datasets [6]. Each recording contains $n=71$ signals, and has been down-sampled by 4 for a final length of $T\simeq 75000$ samples. EEG measures the changes of electric potential induced by brain activity. For such data, the ICA assumption does not perfectly hold. In addition, brain signals are contaminated by a lot of noise and artifacts. Still, it has been shown that ICA succeeds at extracting meaningful and biologically plausible sources from these mixtures [43, 3, 6]. Results are displayed on the top row of Figure 3.

5.5 Experiments on fMRI data

For those experiments, we used preprocessed fMRI data from the ADHD-200 consortium [44]. We perform group ICA using the CanICA framework [45] in Nilearn [46]. From the fMRI data of several patients, CanICA builds a signal matrix to which classical ICA is applied. We use that matrix to benchmark the algorithms. The problem is of size $N=40$ and $T\simeq 60000$ . See middle row of Figure 3.

5.6 Experiments on natural images

Given a grayscale image, we extract $T$ square patches (contiguous squares of pixels) of size $(s,s)$ . Each patch is vectorized, yielding an $s^{2}\times T$ data matrix. One may compute an ICA of this data set and see the columns of the mixing matrix $W^{-1}$ as features, or dictionary atoms, learned from random patches.

The ICA algorithms are run on a set of 100 natural images of open country444http://cvcl.mit.edu/database.htm [47], using $T=30000$ patches of side $8\times 8$ pixels, resulting in a $64\times 30000$ data matrix. The patches are all centered and scaled so that their mean and variance equal 0 and 1 respectively before whitening as in section 5.2. Results are shown at the bottom of Figure 3.

5.7 Discussion

On the first synthetic experiment, where the ICA mixture model holds, second order algorithms are all seen to perform well, converging in a handful of iterations. For this problem, the fastest algorithms are the simple quasi-Newton methods, which means that Picard does not improve significantly over the Hessian approximations $\tilde{H}^{1}$ or $\tilde{H}^{2}$ . This is expected since the Hessian approximations are very efficient when the ICA mixture model holds.

On the two other simulations, the ICA model is not identifiable because of the Gaussian signals. First order methods perform poorly. We can observe that for algorithms relying only on the Hessian approximations (simple quasi-Newton and trust-region ICA), the convergence speed is reduced. On the contrary, Picard and truncated Newton manage to keep a very quick convergence. On those synthetic problems, it is not clear whether or not the greater accuracy of $\tilde{H}^{2}$ over $\tilde{H}^{1}$ justifies the added computation cost.

On EEG and fMRI data, Picard still converges quickly, in a fraction of the time taken by the other algorithms. For this problem, using $\tilde{H}^{2}$ for preconditioning leads to faster convergence than $\tilde{H}^{1}$ . The results are even more striking on images, where Picard, standard L-BFGS and truncated Newton converge in a few seconds while the other algorithms show a very slow linear convergence pattern.

On all experiments, truncated Newton’s method converges in fewer iterations than Picard. This happens because it follows a direction very close to Newton’s true direction, which is the direction each second order algorithm tries to mimic when the current iterate is close to the optimum. However, if we compare algorithms in terms of time, the picture is different: the reduced number of iterations does not make up for the added cost compared to Picard.

5.8 Complexity comparison of truncated Newton and preconditioned L-BFGS

Truncated Newton’s method uses the full information about the curvature. In our experiments, we observe that while this method converges in fewer iterations than Picard, it is slower in terms of CPU time. The speed of truncated Newton depends on many parameters (stopping policy for the conjugate gradient, regularization of $H$ and of $\tilde{H}$ ), so we propose a complexity comparison of this algorithm and Picard, to understand if the former might sometimes be faster than the latter.

Operations carried by the algorithms fall into two categories. First, there are operations that do not scale with the number of samples $T$ , but only with the number of sources $N$ . Regularizing the Hessian, computing $\tilde{H}^{-1}G$ and the L-BFGS inner loop are such operations. The remaining operations scale linearly with $T$ . Computing the score, its derivative, or evaluating the likelihood are all $\theta(N\times T)$ operations. The most costly operations are in $\Theta(N^{2}\times T)$ . They are: computing the gradient, computing $\tilde{H^{2}}$ , and for the truncated Newton method, computing a Hessian free product.

For the following study, let us reason in the context where $T$ is large in front of $N^{2}$ , as it is the case for most real data applications. In that context, we do not count operations not scaling with $T$ . This is a reasonable assumption on real datasets: on the EEG problem, these operations make for less than $1\%$ of the total timing for Picard.

To keep the analysis simple, let us also assume that the operations in $\Theta(N\times T)$ are negligible in front of those in $\Theta(N^{2}\times T)$ . When computing a gradient, the coefficients of $\tilde{H}^{2}$ or a Hessian free product, the costly operation in $\Theta(N^{2}\times T)$ is numerically the same: it is the computation of a matrix product of the form $Y_{1}Y_{2}^{\top}$ where $Y_{1}$ and $Y_{2}$ have the same shape as $Y$ . With that in mind, we assume that each of these operations take the same time, $t_{G}$ .

In order to produce a descent direction, Picard only needs the current gradient and Hessian approximations; the remaining operations do not scale with $T$ . This means that each descent direction takes about $2\times t_{G}$ to be found. This complexity is exactly the same as the simple quasi-Newton method. On the other hand, truncated Newton requires the two same operations, as well as one Hessian-free product for each inner iteration of the conjugate gradient. If we denote by $N_{cg}$ the number of inner loops for the conjugate gradient, we find that truncated Newton’s method takes $(2+N_{cg})\times t_{G}$ to find the descent direction.

Now, in our experiments, we can see that truncated Newton converges in about half as many iterations as Picard. Hence, for truncated Newton to be competitive, each of its iterations should take no longer than twice a Picard iteration. That would require $N_{cg}\leq 2$ but in practice, we observed that many more conjugate gradient iterations are needed (usually more than $10$ ) to provide a satisfying Newton direction. On the other hand, if the conjugate gradient algorithm is allowed to perform only $N_{cg}=2$ inner loops at each iteration, it results in a direction which is far from Newton’s direction, drastically increasing the number of iterations required for convergence.

This analysis leads us to think that the truncated Newton’s method as described in section 4.3 cannot be faster than Picard.

5.9 Study of the control parameters of Picard

Picard has four control parameters: binary choice between two Hessian approximations, memory size $m$ for L-BFGS, number $n_{ls}$ of tries allowed for the backtracking line-search and regularization constant $\lambda_{min}$ . Our experiments indicate that $\tilde{H}^{2}$ is overall a better preconditioner for the algorithm, although the difference with $\tilde{H}^{1}$ can be small.

Through experiments, we found that the memory size had barely no effect in the range $3\leq m\leq 15$ . For a smaller value of $m$ , the algorithm does not have enough information to build a good Hessian approximation. If $m$ is too large, the Hessian built by the algorithm is biased by the landscape explored too far in the past.

The number of tries for the line-search has a tangible effect on convergence speed. Similarly, the optimal regularization constant depends on the difficulty of the problem. However, on the variety of different signals processed in our experiments (synthetic, EEG, fMRI and image), we used the same parameters $m=7$ , $n_{ls}=10$ and $\lambda_{min}=10^{-2}$ . As reported, those values yielded uniformly good performance.

6 Conclusion

While ICA is massively used across scientific domains, computation time for inference can be a bottleneck in many applications. The purpose of this work was to design a fast and accurate algorithm for maximum-likelihood ICA.

For this optimization problem, there are computationally cheap approximations of the Hessian. This leads to simple quasi-Newton algorithms that have a cost per iteration only twice as high as a gradient descent, while offering far better descent directions. Yet, such approximations can be far from the true Hessian on real datasets. As a consequence, practical convergence is not as fast as one can expect from a second order method. Another approach is to use a truncated Newton algorithm, which yields directions closer to Newton’s algorithm, but at a much higher cost per iteration.

In this work, we introduced the Preconditioned ICA for Real Data (Picard) algorithm, which combines both ideas. We use the Hessian approximations as preconditioners for the L-BFGS method. The algorithm refines the Hessian approximations to better take into account the true curvature. The cost per iteration of Picard is similar to the simple quasi-Newton methods, while providing far better descent directions. This was demonstrated, through careful implementation of various literature methods and extensive experiments over synthetic, EEG, fMRI and image data, where we showed clear gains in running time compared to the state-of-the-art.

Future work

The algorithm presented in this article is developed with fixed score functions. It would be of interest to extend it to an adaptive score framework for the recovery of a broader class of sources. An option is to alternate steps of mixing matrix estimation and steps of density estimation, as is done in AMICA for instance. In preliminary experiments, such an approach was found to impair the convergence speed of the algorithm. More evolved methods have to be considered.

Second, the regularization technique presented here is based on a trial and error heuristic which has worked uniformly well on each studied dataset. Still, since the eigenvalues of the Hessian are driven by the statistics of the signals, a careful study might lead to more informed regularization strategies.

Acknowledgments

This work was supported by the Center for Data Science, funded by the IDEX Paris-Saclay, ANR-11-IDEX-0003-02, and the European Research Council (ERC SLAB-YStG-676943).

Acknowledgment

This work was supported by the Center for Data Science, funded by the IDEX Paris-Saclay, ANR-11-IDEX-0003-02, and the European Research Council (ERC SLAB-YStG-676943).

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P. Comon, “Independent component analysis, a new concept?” Signal Processing , vol. 36, no. 3, pp. 287 – 314, 1994.
2[2] A. Hyvärinen and E. Oja, “Independent component analysis: algorithms and applications,” Neural networks , vol. 13, no. 4, pp. 411–430, 2000.
3[3] S. Makeig, T.-P. Jung, A. J. Bell, D. Ghahremani, and T. J. Sejnowski, “Blind separation of auditory event-related brain responses into independent components,” Proceedings of the National Academy of Sciences (PNAS) , vol. 94, no. 20, pp. 10 979–10 984, 1997.
4[4] V. Calhoun, T. Adali, G. Pearlson, and J. Pekar, “A method for making group inferences from functional MRI data using independent component analysis,” Human Brain Mapping , vol. 14, no. 3, pp. 140–151, 2001.
5[5] C. F. Beckmann and S. M. Smith, “Probabilistic independent component analysis for functional magnetic resonance imaging,” IEEE Transactions on Medical Imaging , vol. 23, no. 2, pp. 137–152, Feb 2004.
6[6] A. Delorme, J. Palmer, J. Onton, R. Oostenveld, and S. Makeig, “Independent EEG sources are dipolar,” Plo S one , vol. 7, no. 2, p. e 30135, 2012.
7[7] D. Nuzillard and A. Bijaoui, “Blind source separation and analysis of multispectral astronomical images,” Astronomy and Astrophysics Supplement Series , vol. 147, no. 1, pp. 129–138, 2000.
8[8] D. Maino, A. Farusi, C. Baccigalupi, F. Perrotta, A. Banday, L. Bedini, C. Burigana, G. De Zotti, K. Górski, and E. Salerno, “All-sky astrophysical component separation with fast independent component analysis (FASTICA),” Monthly Notices of the Royal Astronomical Society , vol. 334, no. 1, pp. 53–68, 2002.