Blind nonnegative source separation using biological neural networks

Cengiz Pehlevan; Sreyas Mohan; Dmitri B. Chklovskii

arXiv:1706.00382·q-bio.NC·October 20, 2017

Blind nonnegative source separation using biological neural networks

Cengiz Pehlevan, Sreyas Mohan, Dmitri B. Chklovskii

PDF

TL;DR

This paper introduces a biologically plausible neural network approach for blind nonnegative source separation, formulating it as a similarity matching problem with local learning rules, suitable for online streaming data.

Contribution

It presents a novel formulation of blind nonnegative source separation as a similarity matching problem with biologically plausible neural networks and local learning rules.

Findings

01

Neural networks derived from the similarity matching objective perform blind nonnegative source separation.

02

The approach is suitable for online streaming data scenarios.

03

Synaptic weight updates follow biologically plausible local learning rules.

Abstract

Blind source separation, i.e. extraction of independent sources from a mixture, is an important problem for both artificial and natural signal processing. Here, we address a special case of this problem when sources (but not the mixing matrix) are known to be nonnegative, for example, due to the physical nature of the sources. We search for the solution to this problem that can be implemented using biologically plausible neural networks. Specifically, we consider the online setting where the dataset is streamed to a neural network. The novelty of our approach is that we formulate blind nonnegative source separation as a similarity matching problem and derive neural networks from the similarity matching objective. Importantly, synaptic weights in our networks are updated according to biologically plausible local learning rules.

Equations120

x_{t} = A s_{t},

x_{t} = A s_{t},

C_{s} := ⟨ (s - ⟨ s ⟩) (s - ⟨ s ⟩)^{⊤} ⟩ = I_{d},

C_{s} := ⟨ (s - ⟨ s ⟩) (s - ⟨ s ⟩)^{⊤} ⟩ = I_{d},

\frac{1}{t} δ H δ H^{⊤} = U^{H} Λ^{H} U^{H}^{⊤},

\frac{1}{t} δ H δ H^{⊤} = U^{H} Λ^{H} U^{H}^{⊤},

H^{⊤} H = S^{⊤} S .

H^{⊤} H = S^{⊤} S .

(FA) (FA)^{⊤} = U^{H} Λ^{H} U^{H}^{⊤} .

(FA) (FA)^{⊤} = U^{H} Λ^{H} U^{H}^{⊤} .

(FA)^{⊤} (FA) = I_{d} .

(FA)^{⊤} (FA) = I_{d} .

Y^{*} = Y, Y \geq 0 arg min H^{⊤} H - Y^{⊤} Y_{F}^{2},

Y^{*} = Y, Y \geq 0 arg min H^{⊤} H - Y^{⊤} Y_{F}^{2},

Y^{*} = Y, Y \geq 0 arg min S^{⊤} S - Y^{⊤} Y_{F}^{2} .

Y^{*} = Y, Y \geq 0 arg min S^{⊤} S - Y^{⊤} Y_{F}^{2} .

Y ⟵ max (Y + η (Y H^{⊤} H - Y Y^{⊤} Y), 0),

Y ⟵ max (Y + η (Y H^{⊤} H - Y Y^{⊤} Y), 0),

δ H max Tr (δ X^{⊤} δ X δ H^{⊤} δ H) s.t. δ H^{⊤} δ H ⪯ t I_{t},

δ H max Tr (δ X^{⊤} δ X δ H^{⊤} δ H) s.t. δ H^{⊤} δ H ⪯ t I_{t},

δ H^{*} = U^{H} t Λ^{H}^{'} V^{X}^{⊤},

δ H^{*} = U^{H} t Λ^{H}^{'} V^{X}^{⊤},

δ H min δ G max Tr (- δ X^{⊤} δ X δ H^{⊤} δ H + δ G^{⊤} δ G (δ H^{⊤} δ H - t I_{t})),

δ H min δ G max Tr (- δ X^{⊤} δ X δ H^{⊤} δ H + δ G^{⊤} δ G (δ H^{⊤} δ H - t I_{t})),

{δ h_{t}, δ g_{t}} ⟵ δ h_{t} ar g min g_{t} ar g max Tr (- δ X^{⊤} δ X δ H^{⊤} δ H + δ H^{⊤} δ H δ G^{⊤} δ G - t δ G^{⊤} δ G) .

{δ h_{t}, δ g_{t}} ⟵ δ h_{t} ar g min g_{t} ar g max Tr (- δ X^{⊤} δ X δ H^{⊤} δ H + δ H^{⊤} δ H δ G^{⊤} δ G - t δ G^{⊤} δ G) .

{δ h_{t}, δ g_{t}} ⟵ δ h_{t} ar g min δ g_{t} ar g max L (δ h_{t}, δ g_{t}),

{δ h_{t}, δ g_{t}} ⟵ δ h_{t} ar g min δ g_{t} ar g max L (δ h_{t}, δ g_{t}),

L = - 2 δ x_{t}^{⊤} (t^{'} = 1 \sum t - 1 δ x_{t^{'}} δ h_{t^{'}}^{⊤}) δ h_{t} - t ∥ δ g_{t} ∥_{2}^{2}

L = - 2 δ x_{t}^{⊤} (t^{'} = 1 \sum t - 1 δ x_{t^{'}} δ h_{t^{'}}^{⊤}) δ h_{t} - t ∥ δ g_{t} ∥_{2}^{2}

+ (∥ δ g_{t} ∥_{2}^{2} - ∥ δ x_{t} ∥_{2}^{2}) ∥ δ h_{t} ∥_{2}^{2} .

δ g_{t}^{*}

δ g_{t}^{*}

δ h_{t}^{*}

W_{t}^{H X}

W_{t}^{H X}

W_{t}^{G H}

\frac{d δ h _{t}}{d γ}

\frac{d δ h _{t}}{d γ}

\frac{d δ g _{t}}{d γ}

W_{t + 1}^{H X}

W_{t + 1}^{H X}

W_{t + 1}^{H G}

W_{t + 1}^{G H}

δ h_{t} = F_{t} δ x_{t},

δ h_{t} = F_{t} δ x_{t},

F_{t + 1} = F_{t} + δ F_{t} (t, δ h_{t}, δ g_{t}, F_{t}) .

F_{t + 1} = F_{t} + δ F_{t} (t, δ h_{t}, δ g_{t}, F_{t}) .

\overset{ˉ}{x}_{t} = \frac{1}{t} t^{'} = 1 \sum t x_{t^{'}} = (1 - \frac{1}{t}) \overset{ˉ}{x}_{t - 1} + \frac{1}{t} x_{t} .

\overset{ˉ}{x}_{t} = \frac{1}{t} t^{'} = 1 \sum t x_{t^{'}} = (1 - \frac{1}{t}) \overset{ˉ}{x}_{t - 1} + \frac{1}{t} x_{t} .

\overset{ˉ}{h}_{t}

\overset{ˉ}{h}_{t}

h_{t} = F_{t} x_{t},

h_{t} = F_{t} x_{t},

\overset{ˉ}{h}_{t} = (1 - \frac{1}{t}) \overset{ˉ}{h}_{t - 1} + (1 - \frac{1}{t}) δ F_{t - 1} \overset{ˉ}{x}_{t - 1} + \frac{1}{t} h_{t} .

\overset{ˉ}{h}_{t} = (1 - \frac{1}{t}) \overset{ˉ}{h}_{t - 1} + (1 - \frac{1}{t}) δ F_{t - 1} \overset{ˉ}{x}_{t - 1} + \frac{1}{t} h_{t} .

\overset{ˉ}{h}_{t} = (1 - \frac{1}{t}) \overset{ˉ}{h}_{t - 1} + \frac{1}{t} h_{t} .

\overset{ˉ}{h}_{t} = (1 - \frac{1}{t}) \overset{ˉ}{h}_{t - 1} + \frac{1}{t} h_{t} .

\overset{ˉ}{g}_{t} = (1 - \frac{1}{t}) \overset{ˉ}{g}_{t - 1} + \frac{1}{t} g_{t} .

\overset{ˉ}{g}_{t} = (1 - \frac{1}{t}) \overset{ˉ}{g}_{t - 1} + \frac{1}{t} g_{t} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Blind nonnegative source separation using biological neural networks

Cengiz Pehlevan

Center for Computational Biology, Flatiron Institute, New York, NY

Sreyas Mohan

Center for Computational Biology, Flatiron Institute, New York, NY

IIT Madras, Chennai, India

Dmitri B. Chklovskii

Center for Computational Biology, Flatiron Institute, New York, NY

NYU Medical School, New York, NY

Abstract

Blind source separation, i.e. extraction of independent sources from a mixture, is an important problem for both artificial and natural signal processing. Here, we address a special case of this problem when sources (but not the mixing matrix) are known to be nonnegative, for example, due to the physical nature of the sources. We search for the solution to this problem that can be implemented using biologically plausible neural networks. Specifically, we consider the online setting where the dataset is streamed to a neural network. The novelty of our approach is that we formulate blind nonnegative source separation as a similarity matching problem and derive neural networks from the similarity matching objective. Importantly, synaptic weights in our networks are updated according to biologically plausible local learning rules.

1 Introduction

Extraction of latent causes, or sources, from complex stimuli is essential for making sense of the world. Such stimuli could be mixtures of sounds, mixtures of odors, or natural images. If supervision, or ground truth, about the causes is lacking, the problem is known as blind source separation.

The blind source separation problem can be solved by assuming a generative model, wherein the observed stimuli are linear combinations of independent sources, an approach known as Independent Component Analysis (ICA) (Jutten and Herault, 1991; Comon, 1994; Bell and Sejnowski, 1995; Hyvärinen and Oja, 2000). Formally, the stimulus at time $t$ is expressed as a $k$ -component vector

[TABLE]

where ${\bf A}$ is an unknown but time-independent $k\times d$ mixing matrix and ${\bf s}_{t}$ represents the signals of $d$ sources at time $t$ . In this paper we assume that $k\geq d$ .

The goal of ICA is to infer source signals, ${\bf s}_{t}$ , from the stimuli ${\bf x}_{t}$ . Whereas many ICA algorithms have been developed by the signal processing community (Comon and Jutten, 2010), most of them cannot be implemented by biologically plausible neural networks. Yet, our brains can solve the blind source separation problem effortlessly (Bronkhorst, 2000; Asari et al., 2006; Narayan et al., 2007; Bee and Micheyl, 2008; McDermott, 2009; Mesgarani and Chang, 2012; Golumbic et al., 2013; Isomura et al., 2015). Therefore, discovering a biologically plausible ICA algorithm is an important problem.

For an algorithm to be implementable by biological neural networks it must satisfy (at least) the following requirements. First, it must operate in the online (or streaming) setting. In other words, the input dataset is not available as a whole but is streamed one data vector at a time and the corresponding output must be computed before the next data vector arrives. Second, the output of most neurons in the brain (either a firing rate, or the synaptic vesicle release rate) is nonnegative. Third, the weights of synapses in a neural network must be updated using local learning rules, i.e. depend on the activity of only the corresponding pre- and postsynaptic neurons.

Given the nonnegative nature of neuronal output we consider a special case of ICA where sources are assumed to be nonnegative, termed Nonnegative Independent Component Analysis (NICA), (Plumbley, 2001, 2002). Of course, to recover the sources, one can use standard ICA algorithms that don’t rely on the nonnegativity of sources, such as fastICA (Hyvärinen and Oja, 1997; Hyvarinen, 1999; Hyvärinen and Oja, 2000). Neural learning rules have been proposed for ICA, e.g. (Linsker, 1997; Eagleman et al., 2001; Isomura and Toyoizumi, 2016) and references within. However, taking into account nonnegativity may lead to simpler and more efficient algorithms (Plumbley, 2001, 2003; Plumbley and Oja, 2004; Oja and Plumbley, 2004; Yuan and Oja, 2004; Zheng et al., 2006; Ouedraogo et al., 2010; Li et al., 2016).

While most of the existing NICA algorithms have not met the biological plausibility requirements, in terms of online setting and local learning rules, there are two notable exceptions. First, Plumbley (2001) succesfully simulated a neural network on a small dataset, yet no theoretical analysis was given. Second, Plumbley (2003) and Plumbley and Oja (2004) proposed a nonnegative PCA algorithm for a streaming setting, however its neural implementation requires nonlocal learning rules. Further, this algorithm requires prewhitened data (see also below), yet no streaming whitening algorithm was given.

Here, we propose a biologically plausible NICA algorithm. The novelty of our approach is that the algorithm is derived from the similarity matching principle which postulates that neural circuits map more similar inputs to more similar outputs (Pehlevan et al., 2015). Previous work proposed various objective functions to find similarity matching neural representations and solved these optimization problems with biologically plausible neural networks (Pehlevan et al., 2015; Pehlevan and Chklovskii, 2015a; Pehlevan and Chklovskii, 2014; Hu et al., 2014; Pehlevan and Chklovskii, 2015b). Here we apply these networks to NICA.

The rest of the paper is organized as follows: In Section 2, we show that blind source separation, after a generalized prewhitening step, can be posed as a nonnegative similarity matching (NSM) problem (Pehlevan and Chklovskii, 2014). In Section 3, using results from (Pehlevan and Chklovskii, 2015a; Pehlevan and Chklovskii, 2014) we show that both the generalized prewhitening step and the NSM step can be solved online by neural networks with local learning rules. Stacking these two networks leads to the two-layer NICA network. In Section 4, we compare the performance of our algorithm to other ICA and NICA algorithms for various datasets.

2 Offline NICA via NSM

In this section, we first review Plumbley’s analysis of NICA and then reformulate NICA as an NSM problem.

2.1 Review of Plumbley’s analysis

When source signals are nonnegative, the source separation problem simplifies. It can be solved in two straightforward steps: noncentered prewhitening and orthonormal rotation (Plumbley, 2002).

Noncentered prewhitening transforms ${\bf x}$ to ${\bf h}:={\bf F}{\bf x}$ , where ${\bf h}\in{\mathbb{R}}^{d}$ and ${\bf F}$ is a $d\times k$ whitening matrix111In his analysis Plumbley (Plumbley, 2002) assumed $k=d$ (mixture channels are the same as source channels) but this assumption can be relaxed as shown., such that ${\bf C}_{\bf h}:=\left<\left({\bf h}-\left<{\bf h}\right>\right)\left({\bf h}-\left<{\bf h}\right>\right)^{\top}\right>={\bf I}_{d}$ , where angled brackets denote an average over the source distribution and ${\bf I}_{d}$ is the $d\times d$ identity matrix. Note that the mean of ${\bf x}$ is not removed in the tranformation, otherwise one would not be able to use the constraint that the sources are nonnegative (Plumbley, 2003).

Assuming that sources have unit variance222Without loss of generality, a scalar factor that multiplies a source can always be absorbed into the corresponding column of the mixing matrix,

[TABLE]

the combined effect of source mixing and prewhitening ${\bf FA}$ ( ${\bf h}={\bf Fx}={\bf FAs}$ ) is an orthonormal rotation. To see this, note that, by definition, ${\bf C}_{\bf h}=\left({\bf FA}\right){\bf C}_{s}\left({\bf FA}\right)^{\top}=\left({\bf FA}\right)\left({\bf FA}\right)^{\top}$ and ${\bf C}_{\bf h}={\bf I}_{d}$ .

The second step of NICA relies on the following observation (Plumbley, 2002):

Theorem 1 (Plumbley).

Suppose sources are independent, nonnegative and well-grounded, i.e. Prob $\left(s_{i}<\delta\right)>0$ for any $\delta>0$ . Consider an orthonormal transformation ${\bf y}={\bf Q}{\bf s}$ . Then ${\bf Q}$ is a permutation matrix with probability 1, if and only if ${\bf y}$ is nonnegative.

In the second step, we look for an orthonormal ${\bf Q}$ such that ${\bf y}={\bf Q}{\bf h}$ is nonnegative. When found, Plumbley’s theorem guarantees that ${\bf Q}{\bf F}{\bf x}$ is a permutation of the sources. Several algorithms have been developed based on this observation (Plumbley, 2003; Plumbley and Oja, 2004; Oja and Plumbley, 2004; Yuan and Oja, 2004).

Note that only the sources ${\bf s}$ but not necessarily the mixing matrix ${\bf A}$ must be nonnegative. Therefore, NICA allows generative models, where features not only add up but also cancel each other, as in the presence of a shadow in an image (Plumbley, 2002). In this respect, NICA is more general than Nonnegative Matrix Factorization (NMF) (Lee and Seung, 1999; Paatero and Tapper, 1994) where both the sources and the mixing matrix are required to be nonnegative.

2.2 NICA as NSM

Next we reformulate NICA as a NSM problem. This reformulation will allow us to derive an online neural network for NICA in Section 3. Our main departure from Plumbley’s analysis is to work with similarity matrices rather than covariance matrices and finite number of samples rather than the full probability distribution of the sources.

First, let us switch to the matrix notation, where data matrices are formed by concatenating data column vectors, e.g. ${\bf X}=[{\bf x}_{1},{\bf x}_{2},...,{\bf x}_{t}]$ so that ${\bf X}\in{\mathbb{R}}^{k\times t}$ , and ${\bf S}=[{\bf s}_{1},{\bf s}_{2},...,{\bf s}_{t}]$ so that ${\bf S}\in{\mathbb{R}}^{d\times t}$ . In this notation, we introduce a time-centering operation $\delta$ such that, for example, time-centered stimuli are $\delta{\bf X}:={\bf X}-\bar{\bf X}$ where $\bar{\bf X}:={\bf X}\frac{1}{t}{\bf 1}{\bf 1}^{\top}$ and ${\bf 1}$ is a $t$ dimensional column vector whose elements are all 1’s.

Our goal is to recover ${\bf S}$ from ${\bf X}={\bf A}{\bf S}$ , where ${\bf A}$ is unknown. We make the following two assumptions: First, sources are nonnegative and decorrelated, $\frac{1}{t}\delta{\bf S}\,\delta{\bf S}^{\top}={\bf I}_{d}$ . Note that while general ICA and NICA problems are stated with the independence assumption on the sources, for our purposes, it is sufficient that they are only decorrelated. Second, the mixing matrix, ${\bf A}\in{\mathbb{R}}^{k\times d}$ $(k\geq d)$ , is full-rank.

We propose that the source matrix, ${\bf S}$ , can be recovered from ${\bf X}$ in the following two steps, also illustrated in Fig. 1:

Generalized Prewhitening: Transform ${\bf X}$ to ${\bf H}={\bf F}{\bf X}$ , where ${\bf F}$ is $l\times k$ with $l\geq d$ , so that $\frac{1}{t}\delta{\bf H}\,\delta{\bf H}^{\top}$ has $d$ unit eigenvalues and $l-d$ zero eigenvalues. When $l=d$ , ${\bf H}$ is whitened, otherwise channels of ${\bf H}$ are correlated. Such prewhitening is useful because it implies ${\bf H}^{\top}{\bf H}={\bf S}^{\top}{\bf S}$ according to the following theorem.

Theorem 2.

If ${\bf F}\in{\mathbb{R}}^{l\times k}(l\geq k)$ is such that ${\bf H}={\bf F}{\bf X}$ obeys

[TABLE]

an eigenvalue decomposition with ${\bf\Lambda}^{H}={\rm diag}\big{(}\underset{d}{\underbrace{1,\ldots,1}},\underset{l-d}{\underbrace{0,\ldots,0}}\big{)}$ , then

[TABLE]

Proof.

To see why (3) is sufficient, first note that $\frac{1}{t}\delta{\bf H}\,\delta{\bf H}^{\top}=\left({\bf FA}\right)\left({\bf FA}\right)^{\top}$ . This follows from the definition of ${\bf H}$ and (2). If (3) holds, then

[TABLE]

In turn, this is sufficient to prove that $\left({\bf FA}\right)^{\top}\left({\bf FA}\right)={\bf I}_{d}$ . To see that, assume an SVD decomposition of $\left({\bf FA}\right)={\bf U}^{FA}{\bf\Lambda}^{FA}{{\bf V}^{FA}}^{\top}$ . (5) implies that ${\bf\Lambda}^{FA}{{\bf\Lambda}^{FA}}^{\top}={\bf\Lambda}^{H}$ , i.e. that the $d$ diagonal elements of ${\bf\Lambda}^{FA}\in{\mathbb{R}}^{l\times d}$ are all 1’s. Then,

[TABLE]

This gives us the desired results ${\bf H}^{\top}{\bf H}={\bf S}^{\top}\left({\bf FA}\right)^{\top}\left({\bf FA}\right){\bf S}={\bf S}^{\top}{\bf S}$ . ∎

*Remark 1**.*

If $l>d$ , the channels of ${\bf H}$ are correlated, except in the special case ${\bf U}^{H}={\bf I}_{d}$ . The whitening used in Plumbley’s analysis (Plumbley, 2002) requires $l=d$ . 2. 2.

NSM: Solve the following optimization problem:

[TABLE]

where the optimization is performed over nonnegative ${\bf Y}:=\left[{\bf y}_{1},\ldots,{\bf y}_{t}\right]$ i.e. ${\bf y}_{i}\in{\mathbb{R}}_{+}^{d}$ . We call (7) the NSM cost function (Pehlevan and Chklovskii, 2014). Because inner products quantify similarities we call ${\bf H}^{\top}{\bf H}$ and ${\bf Y}^{\top}{\bf Y}$ input and output similarity matrices, i.e. their elements hold the pairwise similarities between input and the pairwise similarities between output vectors, respectively. Then, the cost function (7) preserves the input similarity structure as much as possible under the nonnegativity constraint. Variants of (7) has been considered in applied math literature under the name “nonnegative symmetric matrix factorization” for clustering applications, e.g. (Kuang et al., 2012, 2015).

Now we make our key observation. Using Theorem 2, we can rewrite (7) as

[TABLE]

Since both ${\bf S}$ and ${\bf Y}$ are nonnegative, rank- $d$ matrices, ${\bf Y}^{*}={\bf P}{\bf S}$ , where ${\bf P}$ is a permutation matrix, is a solution to this optimization problem and the sources are successfully recovered.

Uniqueness of the solutions (up to permutations) is hard to establish. While both sufficient conditions, and necessary and sufficient conditions for uniqueness exist, these are non-trivial to verify and usually the verification is NP-complete (Donoho and Stodden, 2003; Laurberg et al., 2008; Huang et al., 2014). A review of related uniqueness results can be found in (Huang et al., 2014). A necessary condition for uniqueness given in (Huang et al., 2014) states that, if the factorization of ${\bf S}^{\top}{\bf S}$ to ${\bf Y}^{\top}{\bf Y}$ is unique (up to permutations), then each row of ${\bf S}$ contains at least one element that is equal to [math]. This necessary condition is similar to Plumbley’s well-groundedness requirement used in proving Theorem 1.

The NSM problem can be solved by projected gradient descent,

[TABLE]

where the $\max$ operation is applied elementwise, and $\eta$ is the size of the gradient step. Other algorithms can be found in (Kuang et al., 2012, 2015; Huang et al., 2014).

3 Derivation of NICA neural networks from similarity matching objectives

Our analysis in the previous section revealed that the NICA problem can be solved in two steps: generalized prewhitening and nonnegative similarity matching. Here, we derive neural networks for each of these steps and stack them to give a biologically plausible two-layer neural network that operates in a streaming setting.

In a departure from the previous section, the number of output channels is reduced to the number of sources at the prewhitening stage rather than the later NSM stage ( $l=d$ ). This assumption simplifies our analysis significantly. The full problem is addressed in Appendix B.

3.1 Noncentered prewhitening in a streaming input setting

To derive a neurally plausible online algorithm for prewhitening, we pose generalized prewhitening, Eq. (3), as an optimization problem. Online minimization of this optimization problem gives an algorithm that can be mapped to the operation of a biologically plausible neural network.

Generalized prewhitening solves a constrained similarity alignment problem:

[TABLE]

where $\delta{\bf X}$ is the $k\times t$ centered mixture of $d$ independent sources and $\delta{\bf H}$ is a $d\times t$ matrix, constrained such that $t{\bf I}_{t}-{\delta\bf H}^{\top}{\delta\bf H}$ is positive semidefinite. The solution of this objective aligns similarity matrices $\delta{\bf X}^{\top}\delta{\bf X}$ and $\delta{\bf H}^{\top}\delta{\bf H}$ so that their right singular vectors are the same (Pehlevan and Chklovskii, 2015a). Then, the objective under the trace diagonalizes and its value is the sum of eigenvalue pair products. Since the eigenvalues of $\delta{\bf H}^{\top}\delta{\bf H}$ are upper bounded by $t$ , the objective (10) is maximized by setting the eigenvalues of $\delta{\bf H}^{\top}\delta{\bf H}$ that pair with the top $d$ eigenvalues of $\delta{\bf X}^{\top}\delta{\bf X}$ to $t$ , and the rest to zero. Hence, the optimal $\delta{\bf H}$ satisfies the generalized prewhitening condition (3)(Pehlevan and Chklovskii, 2015a). More formally,

Theorem 3 (Modified from (Pehlevan and

Chklovskii, 2015a)).

Suppose an eigen-decomposition of $\delta{\bf X}^{\top}\delta{\bf X}$ is $\delta{\bf X}^{\top}\delta{\bf X}={\bf V}^{X}{\bf\Lambda}^{X}{{\bf V}^{X}}^{\top}$ , where eigenvalues are sorted in decreasing order. Then, all optimal $\delta{\bf H}$ of (10) have an SVD decomposition of the form

[TABLE]

where ${{\bf\Lambda}^{H}}^{\prime}$ is $d\times t$ with $d$ ones on top of the diagonal and zeros on the rest of the diagonal.

The theorem implies that, first, $\frac{1}{t}\delta{\bf H}^{*}\,{\delta{\bf H}^{*}}^{\top}={\bf I}_{d}$ , and hence $\delta{\bf H}$ satisfies the generalized prewhitening condition (3). Second, ${\bf F}$ , the linear mapping between $\delta{\bf H}^{*}$ and ${\delta\bf X}$ , can be constructed using an SVD decomposition of $\delta{\bf X}$ and (11).

The constraint in (10) can be introduced into the objective function using as a Lagrange multiplier the Grassmanian of matrix $\delta{\bf G}\in{\mathbb{R}}^{m\times t}$ with ( $m\geq d$ ):

[TABLE]

This optimization problem (Pehlevan and Chklovskii, 2015a) will be used to derive an online neural algorithm.

Whereas the optimization problem (12) is formulated in the offline setting, i.e. outputs are computed only after receiving all inputs, to derive a biologically plausible algorithm, we need to formulate the optimization problem in the online setting, i.e. the algorithm receives inputs sequentially, one at a time, and computes an output before the next input arrives. Therefore, we optimize (12) only for the data already received and only with respect to the current output:

[TABLE]

By keeping only those terms that depend on $\delta{\bf h}_{t}$ or $\delta{\bf g}_{t}$ , we get the following objective:

[TABLE]

where

[TABLE]

In the large- $t$ limit, the first three terms dominate over the last term, which we ignore. The remaining objective is strictly concave in $\delta{\bf g}_{t}$ and convex in $\delta{\bf h}_{t}$ . We assume that the matrix $\frac{1}{t}{\sum\limits_{t^{\prime}=1}^{t-1}\delta{\bf h}_{t^{\prime}}}{\delta{\bf g}^{\top}_{t^{\prime}}}$ is full-rank. Then, the objective has a unique saddle point :

[TABLE]

where,

[TABLE]

Hence, ${\bf F}_{t}:=\left({\bf W}^{HG}_{t}{\bf W}^{GH}_{t}\right)^{-1}{\bf W}^{HX}_{t}$ can be interpreted as the current estimate of the prewhitening matrix, ${\bf F}$ .

We solve (14) with a gradient descent-ascent

[TABLE]

where $\gamma$ measures “time” within a single time step of $t$ . Biologically, this is justified if the activity dynamics converges faster than the correlation time of the input data. The dynamics (3.1) can be proved to converge to the saddle point of the objective (3.1), see Appendix A.

Equation (3.1) describes the dynamics of a single-layer neural network with two-populations, Fig. 2. ${\bf W}^{HX}_{t}$ represents the weights of feedforward synaptic connections, ${\bf W}^{HG}_{t}$ and ${\bf W}^{GH}_{t}$ represent the weights of synaptic connections between the two populations. Remarkably, synaptic weights appear in the online algorithm despite their absence in the optimization problem formulations (12) and (13). Furthermore, $\delta\bf{h}_{t}$ neurons can be associated with principal neurons of a biological circuit and $\delta\bf{g}_{t}$ neurons with interneurons.

Finally, using the definition of synaptic weight matrices (3.1), we can formulate recursive update rules:

[TABLE]

Equations (3.1) and (3.1) define a neural algorithm that proceeds in two phases. After each stimulus presentation, first, (3.1) is iterated until convergence by the dynamics of neuronal activities. Second, synaptic weights are updated according to local, anti-Hebbian (for synapses from interneurons) and Hebbian (for all other synapses) rules (3.1). Biologically, synaptic weights are updated on a slower timescale than neuronal activity dynamics.

Our algorithm can be viewed as a special case of the algorithm proposed in (Plumbley, 1996, 1994). Plumbley analyzed the convergence of synaptic weights (Plumbley, 1994) in a stochastic setting by a linear stability analysis of the stationary point of synaptic weight updates. His results are directly applicable to our algorithm, and show that, if the synaptic weights of our algorithm converge to a stationary state, they whiten the input.

Importantly, unlike (Plumbley, 1996, 1994) which proposed the algorithm heuristically, we derived it by posing and solving an optimization problem.

3.1.1 Computing $\bar{\bf H}$

The optimization problem (12) and the corresponding neural algorithm, Eqs. (3.1) and (3.1) almost achieve what is needed for noncentered prewhitening, but we still need to find $\bar{\bf H}$ , since for the NSM step we need ${\bf H}=\delta{\bf H}+\bar{\bf H}$ . We now discuss how $\bar{\bf H}$ can be learned along with $\delta{\bf H}$ using the same online algorithm.

Our online algorithm for centered-whitening is of the following form. First, a neural dynamics stage outputs a linear transformation of the input:

[TABLE]

and, second, synaptic weights and, hence, ${\bf F}_{t}$ are updated:

[TABLE]

We can supplement this algorithm with a running estimate of $\bar{\bf h}$ . Let the running estimate of average stimulus activity be

[TABLE]

Then,

[TABLE]

Alternatively, (20) and (23) can be combined into a single step:

[TABLE]

where the network receives uncentered stimuli and prewhitenes it. Note that assignment (24) can still be done by iterating (3.1), except now the input is ${\bf x}_{t}$ rather than $\delta{\bf x}_{t}$ . However, synaptic weights are still updated using $\delta{\bf x}_{t}={\bf x}_{t}-\bar{\bf x}_{t}$ , $\delta{\bf h}_{t}={\bf h}_{t}-\bar{\bf h}_{t}$ and $\delta{\bf g}_{t}={\bf g}_{t}-\bar{\bf g}_{t}$ . Therefore we keep recursive estimates of the means. Substituting (22) into (24) and using (21)

[TABLE]

The term $\left(1-\frac{1}{t}\right)\delta{\bf F}_{t-1}\bar{\bf x}_{t-1}$ requires non-local calculations. Assuming that in the large- $t$ limit updates to ${\bf F}$ are small, we can ignore this term and obtain a recursion:

[TABLE]

Finally, a similar argument can be given for $\bar{\bf g}_{t}$ . We keep a recursive estimate of $\bar{\bf g}_{t}$ :

[TABLE]

To summarize, when a new stimulus ${\bf x}_{t}$ is observed, the algorithm operates in two steps. In the first step, the following two-population neural dynamics runs until convergence to a fixed point:

[TABLE]

The convergence proof for neural dynamics (3.1) given in Appendix A also applies here. Besides the synaptic weight, each neuron remembers its own average activity and each synapse remembers average incoming activity. In the second step of the algorithm, the average activities are updated by:

[TABLE]

Synaptic weight matrices are updated recursively by

[TABLE]

Once the synaptic updates are done, the new stimulus, ${\bf x}_{t+1}$ , is processed. We note again that all the synaptic update rules are local, and hence are biologically plausible.

3.2 Online NSM

Next, we derive the second-layer network which solves the NSM optimization problem (7) in an online setting (Pehlevan and Chklovskii, 2014).

The online optimization problem is:

[TABLE]

Proceeding as before, let’s rewrite (31) keeping only terms that depend on ${\bf y}_{t}$ :

[TABLE]

In the large- $t$ limit, the last two terms can be ignored and the remainder is a quadratic form in ${\bf y}_{t}$ . We minimize it using coordinate descent (Wright, 2015) which is both fast and neurally plausible. In this approach, neurons are updated one-by-one by performing an exact minimization of (32) with respect to $y_{t,i}$ until convergence:

[TABLE]

where

[TABLE]

For the next time step ( $t+1$ ), we can update the synaptic weights recursively, giving us the following local learning rules:

[TABLE]

Interestingly, these update rules are local and are identical to the single-neuron Oja rule (Oja, 1982), except that the learning rate is given explicitly in terms of cumulative activity $1/D_{t,i}$ and the lateral connections are anti-Hebbian.

After the arrival of each data vector, the operation of the complete two-layer network algorithm, Fig. 2, is as follows. First, the dynamics of the prewhitening network runs until convergence. Then the output of the prewhitening network is fed to the NSM network, and the NSM network dynamics runs until convergence to a fixed point. Synaptic weights are updated in both networks for processing the next data vector.

3.2.1 NICA is a stationary state of online NSM

Here we show that the solution to the NICA problem is a stationary synaptic weights state of the online NSM algorithm. In the stationary state the expected updates to synaptic weights are zero, i.e.

[TABLE]

where we dropped the $t$ index, and brackets denote averages over the source distribution.

Suppose the stimuli obey the NICA generative model, Eq. (1), and the observed mixture, ${\bf x}_{t}$ , is whitened with the exact (generalized) prewhitening matrix ${\bf F}$ described in Theorem 2. Then, input to the network at time, $t$ , is ${\bf h}_{t}={\bf F}{\bf x}_{t}={\bf F}{\bf A}{\bf s}_{t}$ . Our claim is that there exists synaptic weight configurations for which 1) for any mixed input, ${\bf x}_{t}$ , the output of the network is the source vector, i.e. ${\bf y}_{t}={\bf P}{\bf s}_{t}$ , where ${\bf P}$ is a permutation matrix, and 2) this synaptic configuration is a stationary state.

We prove our claim by constructing these synaptic weights. For each permutation matrix, we first relabel the outputs such that $i^{\rm th}$ output recovers the $i^{\rm th}$ source and hence ${\bf P}$ becomes the identity matrix. Then, the weights are:

[TABLE]

Given mixture ${\bf x}_{t}$ , NSM neural dynamics with these weights converge to $y_{t,i}=s_{t,i}$ , which is the the unique fixed point333Proof: The net input to neuron $i$ at the claimed fixed point is $\sum_{j}W^{YH}_{ij}h_{t,j}-\sum_{j\neq i}W^{YY}_{ij}s_{t,j}$ . Plugging in (37) and ${\bf h}_{t}={\bf F}{\bf A}{\bf s}_{t}$ , and using (6) one gets that the net input is $s_{t,i}$ , which is also the output since sources are nonnegative. This fixed point is unique and globally stable because the NSM neural dynamics is a coordinate descent on a strictly convex cost given by $\frac{1}{2}{\bf y}_{t}^{\top}\left\langle{\bf s}{\bf s}^{\top}\right\rangle{\bf y}_{t}-{\bf h}_{t}^{\top}\left\langle{\bf h}{\bf s}^{\top}\right\rangle{\bf y}_{t}$ .. Furthermore, these weights define a stationary state as defined in (36) assuming a fixed learning rate. To see this substitute weights from (37) into the last two equations of (3.2) and average over the source distribution. The fixed learning rate assumption is valid in the large- $t$ limit when changes to $D_{t,i}$ become small ( ${\mathcal{O}(1/t)}$ , see (Pehlevan et al., 2015)).

4 Numerical simulations

Here we present numerical simulations of our two-layered neural network using various datasets and compare the results to that of other algorithms.

In all our simulations, $d=k=l=m$ , except in Fig. 5B where $d=k>l=m$ . Our networks were initialized as follows:

In the prewhitening network, ${\bf W}^{HX}$ and ${\bf W}^{HG}$ were chosen to be random orthonormal matrices. ${\bf W}^{GH}$ is initialized as ${{\bf W}^{HG}}^{\top}$ because of its definition in Eq. (3.1) and the fact that this choice guarantees the convergence of the neural dynamics (3.1.1) (see Appendix A). 2. 2.

In the NSM network, ${\bf W}^{YH}$ was initialized to a random orthonormal matrix and ${\bf W}^{YY}$ was set to zero.

The learning rates were chosen as follows:

For the prewhitening network, we generalized the time-dependent learning rate (3.1.1) to,

[TABLE]

and performed a grid search over $a\in\{10,10^{2},10^{3},10^{4}\}$ and $b\in\{10^{-2},10^{-1},1\}$ to find the combination with best performance. Our performance measures will be introduced below. 2. 2.

For the NSM network, we generalized the activity-dependent learning rate (3.2) to,

[TABLE]

and performed a grid search over several values of $\tilde{a}\in\{10,10^{2},10^{3},10^{4}\}$ and $\tilde{b}\in\{0.8,0.9,0.95,0.99,0.995,0.999,0.9999,1\}$ to find the combination with best performance. The $\tilde{b}$ parameter introduces “forgetting” to the system (Pehlevan et al., 2015). We hypothesized that forgetting will be beneficial in the two-layer setting because the prewhitening layer output changes over time and the NSM layer has to adapt. Further, for comparison purposes, we also implemented this algorithm with a time-dependent learning rate of the form (38) and performed a grid search with $a\in\{10^{2},10^{3},10^{4}\}$ and $b\in\{10^{-2},10^{-1},1\}$ to find the combination with best performance.

For the NSM network, to speed up our simulations we implemented a procedure from (Plumbley and Oja, 2004). At each iteration we checked whether there is any output neuron who has not fired up until that iteration. If so, we flipped the sign of its feedforward inputs. In practice, the flipping only occured within the first $\sim$ 10 iterations.

For comparison, we implemented five other algorithms. First is the offline algorithm (9), the other two are chosen to represent major online algorithm classes:

Offline projected gradient descent: We simulated the projected gradient descent algorithm (9). We used variable stepsizes of the form (38) and performed a grid search with $a\in\{10^{4},10^{5},10^{6}\}$ and $b\in\{10^{-3},10^{-2},10^{-1}\}$ to find the combination with best performance. We initialized elements of the matrix, ${\bf Y}$ , by drawing a Gaussian random variable with zero mean and unit variance and rectifying it. Input dataset was whitened offline before passing to projected gradient descent. 2. 2.

fastICA: fastICA (Hyvärinen and Oja, 1997; Hyvarinen, 1999; Hyvärinen and Oja, 2000) is a popular ICA algorithm which does not assume nonnegativity of sources. We implemented an online version of fastICA (Hyvärinen and Oja, 1998) using the same parameters except for feedforward weights. We used the time-dependent learning rate (38) and performed a grid search with $a\in\{10,10^{2},10^{3},10^{4}\}$ and $b\in\{10^{-2},10^{-1},1\}$ to find the combination with best performance. fastICA requires whitened and centered input (Hyvärinen and Oja, 1998) and computes a decoding matrix that maps mixtures back to sources. We ran the algorithm with whitened and centered input. To recover nonnegative sources, we applied the decoding matrix to noncentered but whitened input. 3. 3.

Infomax ICA: Bell and Sejnowski (1995) proposed a blind source separation algorithm that maximizes the mutual information between inputs and outputs, namely the Infomax principle (Linsker, 1988). We simulated an online version due to Amari et al. (1996). We chose cubic neural nonlinearities compatible with sub-Gaussian input sources. This differs from our fastICA implementation where the nonlinearity is also learned online. Infomax ICA computes a decoding matrix using centered, but not whitened, data. To recover nonnegative sources, we applied the decoding matrix to noncentered inputs. Finally, we rescaled the sources so that their variance is 1. We experimented with several learning rate parameters for finding optimal performance. 4. 4.

Linsker’s network: Linsker (1997) proposed a neural network with local learning rules for Infomax ICA. We simulated this algorithm with cubic neural nonlinearities and preprocessing and decoding done as in our Infomax ICA implementation. 5. 5.

Nonnegative PCA: Nonnegative PCA algorithm (Plumbley and Oja, 2004) solves the NICA task and makes explicit use of the nonnegativity of sources. We use the online version given in (Plumbley and Oja, 2004). To speed up our simulations we implemented a procedure from (Plumbley and Oja, 2004). At each iteration we checked whether there is any output neuron who has not fired up until that iteration. If so, we flipped the sign of its feedforward inputs. For this algorithm, we again used the time-dependent learning rate of (38) and performed a grid search with $a\in\{10,10^{2},10^{3},10^{4}\}$ and $b\in\{10^{-2},10^{-1},1\}$ to find the combination with best performance. Nonnegative PCA assumes whitened, but not centered input (Plumbley and Oja, 2004).

Next, we present the results of our simulations on three datasets.

4.1 Mixture of random uniform sources

The source i.i.d. samples were set to zero with probability 0.5 and sampled uniformly from iterval $[0,\sqrt{48/5}]$ with probability 0.5. The dimensions of source vectors were $d=\{3,5,7,10\}$ . The mixing matrices are given in Appendix C. $10^{5}$ source vectors were generated for each run. For a sample of the original and mixed signals, see Fig 3A.

The inputs to fastICA and Nonnegative PCA algorithms were prewhitened offline, and in the case of fastICA they were also centered. We ran our NSM network both as a single layer algorithm with prewhitening done offline, and as a part of our two-layer algorithm with whitening done online.

To quantify the performance of tested algorithms, we used the mean-squared-error:

[TABLE]

where ${\bf P}$ is a permutation matrix that is chosen to minimize the mean-squared-error at $t=10^{5}$ . The learning rate parameters of all networks were optimized by a grid search using $E_{10^{5}}$ as the performance metric.

In Fig. 3B, we show the performances of all algorithms we implemented. Our algorithms perform as well or better than others, especially as dimensionality of the input increases. Offline whitening is better than online whitening, however, as dimensionality increases, online whitening becomes competitive with offline whitening. In fact, our two-layer and single-layer networks perform better than Online fastICA and Nonnegative PCA for which whitening was done offline.

We also simulated a fully offline algorithm by taking projected gradient descent steps until the residual error plateaued (Fig. 3B). The performance of the offline algorithm quantifies two important metrics. First, it establishes the loss in performance due to online (as opposed to offline) processing. Second, it establishes the lowest error that could be achieved by the NSM method for the given dataset. The lowest error is not necessarily zero due to the finite size of the dataset. This method is not perfect because the projected gradient descent may get stuck in a local minimum of Eq. (7).

We also tested whether the learned synaptic weights of our network match our theoretical predictions. In Fig. 4A, we show examples of learned feedforward and recurrent synaptic weights at $t=10^{5}$ , and what is expected from our theory (37). We observed an almost perfect match between the two. In Fig. 4B, we quantify the convergence of simulated synaptic weights to the theoretical prediction by plotting a normalized error metric defined by $E_{t}=\left\|{\bf W}_{t,{\rm simulation}}-{\bf W}_{{\rm theory}}\right\|_{F}^{2}/\left\|{\bf W}_{{\rm theory}}\right\|_{F}^{2}$ .

4.2 Mixture of random uniform and exponential sources

Our algorithm can demix sources sampled from different statistical distributions. To illustrate this point, we generated a dataset with two uniform and three exponential source channels. The uniform sources were sampled as before. The exponential sources were either zero (with probability 0.5) or sampled from an exponential distribution, scaled so that the variance of the channel is 1. In Fig. 5A, we show that the algorithm succesfully recovers sources.

To test denoising capabilities of our algorithm, we created a dataset where source signals are accompanied by background noise. Sources to be recovered were three exponential channels, which were sampled as before. Background noises were two uniform channels which were sampled as before, except scaled to have variance 0.1. To denoise the resulting five dimensional mixture, the prewhitening layer reduced its five input dimensions to three. Then, the NSM layer succesfully recovered sources, Fig. 5B. Hence, the prewhitening layer can act as a denoising stage.

4.3 Mixture of natural scenes

Next, we consider recovering images from their mixtures, Fig. 6A, where each image is treated as one source. Four image patches of size $252\times 252$ pixels were chosen from a set of images of natural scenes which were previously used in (Hyvärinen and Hoyer, 2000; Plumbley and Oja, 2004). The preprocessing was as in (Plumbley and Oja, 2004): 1) Images were downsampled by a factor of 4 to obtain $63\times 63$ patches, 2) Pixel intensities were shifted to have a minimum of zero and 3) Pixel intensities were scaled to have unit variance. Hence, in this dataset, there are $d=4$ sources, corresponding image patches, and a total of $63\times 63=3969$ samples. These samples were presented to the algorithm 5000 times with randomly permuted order in each presentation. The $4\times 4$ mixing matrix, which was generated randomly, is given in Appendix C.

In Fig. 6B, we show the performances of all algorithms we implemented in this task. We see that our algorithms, when compared to fastICA and Nonnegative PCA, perform much better.

5 Discussion

In this paper we presented a new neural algorithm for blind nonnegative source separation. We started by assuming the nonnegative ICA generative model (Plumbley, 2001, 2002) where inputs are linear mixtures of independent and nonnegative sources. We showed that the sources can be recovered from inputs by two sequential steps, 1) generalized whitening and 2) NSM. In fact, our argument requires sources to be only uncorrelated, and not necessarily independent. Each of the two steps can be performed online with single-layer neural networks with local learning rules (Pehlevan and Chklovskii, 2014; Pehlevan and Chklovskii, 2015a). Stacking these two networks yields a two-layer neural network for blind nonnegative source separation (Fig. 2). Numerical simulations show that our neural network algorithm performs well.

Because our network is derived from optimization principles, its biologically realistic features can be given meaning. The network is multi-layered, because each layer performs a different optimization. Lateral connections create competition between principal neurons forcing them to differentiate their outputs. Interneurons clamp the activity dimensions of principal neurons (Pehlevan and Chklovskii, 2015a). Rectifying neural nonlinearity is related to nonnegativity of sources. Synaptic plasticity (Malenka and Bear, 2004), implemented by local Hebbian and anti-Hebbian learning rules, achieves online learning. While Hebbian learning is famously observed in neural circuits (Bliss and Lømo, 1973; Bliss and Gardner-Medwin, 1973), our network also makes heavy use of anti-Hebbian learning, which can be interpreted as the long-term potentiation of inhibitory postsynaptic potentials. Experiments show that such long-term potentiation can arise from pairing action potentials in inhibitory neurons with subthreshold depolarization of postsynaptic pyramidal neurons (Komatsu, 1994; Maffei et al., 2006). However, plasticity in inhibitory synapses does not have to be Hebbian, i.e. require correlation between pre- and postsynaptic activity (Kullmann et al., 2012).

For improved biological realism, the network should respond to a continuous stimulus stream by continuous and simultaneous changes to its outputs and synaptic weights. Presumably, this requires neural time scales to be faster and synaptic time scales to be slower than that of changes in stimuli. To explore this possibility, we simulated some of our datasets with limited number of neural activity updates (not shown) and found that $\sim$ 10 updates per neuron is sufficient for successful recovery of sources without significant loss in performance. With a neural time scale of 10ms, this should take about 100ms, which is sufficiently fast given that, for example, the temporal autocorrelation time scale of natural image sequences is about 500ms (David et al., 2004; Bull, 2014).

It is interesting to compare the two-layered architecture we present to the multilayer neural networks of deep learning approaches (LeCun et al., 2015). 1) For each data presentation, our network performs recurrent dynamics to produce an output, while the deep networks have feedforward architecture. 2) The first layer of our network has multiple neuron types, principal and interneurons, and only principal neurons project to the next layer. In deep learning, all neurons in a layer project to the next layer. 3) Our network operates with local learning rules, while deep learning uses backpropagation, which is not local. 4) We derived the architecture, the dynamics, and the learning rules of our network from a principled cost function. In deep learning, the architecture and the dynamics of a neural network are designed by hand, only the learning rule is derived from a cost function. 5) Finally, in building a neural algorithm, we started with a generative model of inputs, from which we inferred algorithmic steps to recover latent sources. These algorithmic steps guided us in deciding which single-layer networks to stack. In deep learning, no such generative model is assumed and network architecture design is more of an art. We believe starting from a generative model might lead to a more systematic way of network design. In fact, the question of generative model appropriate for deep networks is already being asked (Patel et al., 2016).

Acknowledgments

We thank Andrea Giovannucci, Eftychios Pnevmatikakis, Anirvan Sengupta and Sebastian Seung for useful discussions. DC is grateful to the IARPA MICRONS program for support.

Appendix A Convergence of the gradient descent-ascent dynamics

Here we prove that the neural dynamics (3.1) converges to the saddle point of the objective function (3.1). Here we assume that ${\bf W}^{HG}$ is full-rank and $l=d$ . First, note that the optimum of (3.1) is also the fixed point of (3.1). Since the neural dynamics (3.1) is linear, the fixed point is globally convergent if and only if the eigenvalues of the matrix

[TABLE]

have negative real parts.

The eigenvalue equation is

[TABLE]

which implies

[TABLE]

Using these relations, we can solve for all the $d+m$ eigenvalues. There are two cases:

$\lambda=-1$ . This implies that ${\bf W}^{GH}{\bf x}_{1}=0$ and ${\bf W}^{HG}{\bf x}_{2}={\bf x}_{1}$ . ${\bf x}_{1}$ is in the null-space of ${\bf W}^{GH}$ . Since ${\bf W}^{GH}$ is $m\times d$ with $m\geq d$ , the null-space is $m-d$ dimensional, and one has $m-d$ degenerate $\lambda=-1$ eigenvalues. 2. 2.

$\lambda\neq-1$ . Substituting for ${\bf x}_{2}$ in the first equation of (51), this implies that ${\bf W}^{HG}{\bf W}^{GH}{\bf x}_{1}=-\lambda\left(\lambda+1\right){\bf x}_{1}$ . Hence, ${\bf x}_{1}$ is an eigenvector of ${\bf W}^{HG}{\bf W}^{GH}$ . For each eigenvalue $\lambda_{w}$ of ${\bf W}^{HG}{\bf W}^{GH}$ , there are two corresponding eigenvectors ${\bf\lambda}=-\frac{1}{2}\pm\sqrt{\frac{1}{4}-\lambda_{w}}$ . ${\bf x}_{2}$ can be solved uniquely from the first equation in (51).

Hence, there are $m-d$ degenerate $\lambda=-1$ eigenvalues and $d$ pairs of conjugate eigenvalues ${\bf\lambda}=-\frac{1}{2}\pm\sqrt{\frac{1}{4}-\lambda_{w}}$ , one pair for each eigenvaleue $\lambda_{w}$ of ${\bf W}^{HG}{\bf W}^{GH}$ . Since $\{\lambda_{w}\}$ are real and positive (we assume ${\bf W}^{HG}$ is full-rank and by definition ${\bf W}^{HG}={{\bf W}^{GH}}^{\top}$ ), real parts of all $\{\lambda\}$ are negative and hence the neural dynamics (3.1) is globally convergent.

Appendix B Modified objective function and neural network for generalized prewhitening

While deriving our online neural algorithm, we assumed that the number of output channels is reduced to the number of sources at the prewhitening stage ( $l=d$ ). However, our offline analysis did not need such reduction, one could keep $l\geq d$ for generalized prewhitening. Here we provide an online neural algorithm that allows $l\geq d$ .

First, we point out why the prewhitening algorithm given in the main text is not adequate for $l>d$ . In Appendix A, we proved that the neural dynamics described by (3.1) converges to the saddle point of the objective function (3.1). This proof assumes that ${\bf W}^{HG}$ is full-rank. However, if $l>d$ , this assumption breaks down as the network learns because perfectly prewhitened $\delta{\bf H}$ has rank $d$ (low-rank) and a perfectly prewhitening network would have ${\bf W}^{HG}=\delta{\bf H}\delta{\bf G}^{\top}$ which would also be low-rank. We simulated this network with $l>d$ and observed that the condition number of ${\bf W}^{HG}{\bf W}^{GH}$ increased with $t$ and the neural dynamics took longer time to converge. Even though the algorithm was still functioning well for practical purposes, we present a modification that fully resolves the problem.

We propose a modified offline objective function (Pehlevan and Chklovskii, 2015a) and a corresponding neural network. Consider the following:

[TABLE]

where $\delta{\bf X}$ is a $k\times t$ centered mixture of $d$ independent sources, $\delta{\bf H}$ is now an $l\times t$ matrix with $l\geq d$ , $\delta{\bf G}$ is an $m\times t$ matrix with $m\geq d$ and $\alpha$ is a positive parameter. Notice the additional $\alpha$ -dependent term compared to (12). If $\alpha$ is less than the lowest eigenvalue of $\frac{1}{t}{\bf\delta X}{\bf\delta X}^{\top}$ , optimal $\delta{\bf H}$ is a linear transform of ${\bf X}$ and satisfies the generalized prewhitening condition (3)(Pehlevan and Chklovskii, 2015a). More precisely,

Theorem 4 (Modified from (Pehlevan and

Chklovskii, 2015a)).

Suppose an eigen-decomposition of $\delta{\bf X}^{\top}\delta{\bf X}$ is $\delta{\bf X}^{\top}\delta{\bf X}={\bf V}^{X}{\bf\Lambda}^{X}{{\bf V}^{X}}^{\top}$ , where eigenvalues are sorted in order of magnitude. If $\alpha$ is less than the lowest eigenvalue of $\frac{1}{t}{\bf\delta X}{\bf\delta X}^{\top}$ , all optimal $\delta{\bf H}$ of (12) have an SVD decomposition of the form

[TABLE]

where ${{\bf\Lambda}^{H}}^{\prime}$ is $l\times t$ with ones at top $d$ diagonals and zeros at rest.

Using this cost function, we will derive a neural algorithm which does not suffer from the described convergence issues, even if $l>d$ . On the other hand, we now need to choose the parameter $\alpha$ , and for that we need to know the spectral properties of ${\delta\bf X}$ .

To derive an online algorithm, we repeat the steps taken before:

[TABLE]

where

[TABLE]

In the large- $t$ limit, the first four terms dominate over the last term, which we ignore. The remaining objective is strictly concave in $\delta{\bf g}_{t}$ and strictly convex in $\delta{\bf h}_{t}$ . Note that (3.1) was only convex in $\delta{\bf h}_{t}$ but not strictly convex. The objective has a unique saddle point, even if $\frac{1}{t}{\sum\limits_{t^{\prime}=1}^{t-1}\delta{\bf h}_{t^{\prime}}}{\delta{\bf g}^{\top}_{t^{\prime}}}$ is not full-rank:

[TABLE]

where ${\bf W}$ matrices are defined as before and ${\bf I}$ is the identity matrix.

We solve (54) with gradient descent-ascent

[TABLE]

where $\gamma$ is time measured within a single time step of $t$ . The dynamics (3.1) can be proved to converge to the saddle point (B) modifying the proof in Appendix A444The fixed point is globally convergent if and only if the eigenvalues of the matrix

$\displaystyle\left[\begin{array}[]{c c}-\alpha{\bf I}&-{\bf W}^{HG}\\ {\bf W}^{GH}&-{\bf I}\end{array}\right]$

(60)

have negative real parts. One can show that $l-d$ eigenvalues are $-\alpha$ , $m-d$ eigenvalues are $-1$ , and for each positive eigenvalue, $\lambda_{w}$ of ${\bf W}^{HG}{\bf W}^{GH}$ one gets a pair $-\frac{1+\alpha}{2}\pm\sqrt{\frac{\left(1+\alpha\right)^{2}}{4}-\alpha-\lambda_{w}}$ . All eigenvalues have negative real parts. . Synaptic weight updates are the same as before (3.1). Finally, this network can be modified to also compute $\bar{\bf H}$ following the steps before.

Appendix C Mixing matrices for numerical simulations

For the random source dataset, the $d=3$ mixing matrix was:

[TABLE]

We do not list the mixing matrices for $d=\{5,7,10\}$ cases for space-saving purposes, however they are available from authors upon request.

For the natural scene dataset, the mixing matrix was

[TABLE]

Appendix D Learning rate parameters for numerical simulations

For Figs. 3, 4, 5 and 6 the following parameters were found to be best performing as a result of our grid search:

[TABLE]

Bibliography57

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Amari et al. (1996) Amari, S., Cichocki, A., and Yang, H. (1996). A new learning algorithm for blind signal separation. Advances in neural information processing systems , 8:757–763.
2Asari et al. (2006) Asari, H., Pearlmutter, B. A., and Zador, A. M. (2006). Sparse representations for the cocktail party problem. Journal of Neuroscience , 26(28):7477–7490.
3Bee and Micheyl (2008) Bee, M. A. and Micheyl, C. (2008). The cocktail party problem: what is it? how can it be solved? and why should animal behaviorists study it? Journal of comparative psychology , 122(3):235.
4Bell and Sejnowski (1995) Bell, A. J. and Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural computation , 7(6):1129–1159.
5Bliss and Gardner-Medwin (1973) Bliss, T. V. and Gardner-Medwin, A. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the unanaesthetized rabbit following stimulation of the perforant path. The Journal of physiology , 232(2):357.
6Bliss and Lømo (1973) Bliss, T. V. and Lømo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. The Journal of physiology , 232(2):331–356.
7Bronkhorst (2000) Bronkhorst, A. W. (2000). The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions. Acta Acustica united with Acustica , 86(1):117–128.
8Bull (2014) Bull, D. R. (2014). Communicating pictures: A course in Image and Video Coding . Academic Press.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Blind nonnegative source separation using biological neural networks

Abstract

1 Introduction

2 Offline NICA via NSM

2.1 Review of Plumbley’s analysis

Theorem 1** (Plumbley).**

2.2 NICA as NSM

Theorem 2**.**

Proof.

Remark 1*.*

3 Derivation of NICA neural networks from similarity matching objectives

3.1 Noncentered prewhitening in a streaming input setting

Theorem 3** **(Modified from (Pehlevan and

3.1.1 Computing Hˉ\bar{\bf H}Hˉ

3.2 Online NSM

3.2.1 NICA is a stationary state of online NSM

4 Numerical simulations

4.1 Mixture of random uniform sources

4.2 Mixture of random uniform and exponential sources

4.3 Mixture of natural scenes

5 Discussion

Acknowledgments

Appendix A Convergence of the gradient descent-ascent dynamics

Appendix B Modified objective function and neural network for generalized prewhitening

Theorem 4** **(Modified from (Pehlevan and

Appendix C Mixing matrices for numerical simulations

Appendix D Learning rate parameters for numerical simulations

Theorem 1 (Plumbley).

Theorem 2.

*Remark 1**.*

Theorem 3 (Modified from (Pehlevan and

3.1.1 Computing $\bar{\bf H}$

Theorem 4 (Modified from (Pehlevan and