Learning Nonlinear Mixtures: Identifiability and Algorithm

Bo Yang; Xiao Fu; Nicholas D. Sidiropoulos; Kejun Huang

arXiv:1901.01568·cs.LG·February 24, 2021

Learning Nonlinear Mixtures: Identifiability and Algorithm

Bo Yang, Xiao Fu, Nicholas D. Sidiropoulos, Kejun Huang

PDF

TL;DR

This paper introduces a new method for identifying and learning nonlinear mixture models using neural networks, providing theoretical guarantees and demonstrating effectiveness on synthetic and real data.

Contribution

It proposes a novel identifiability criterion for nonlinear mixture models and develops a neural network-based algorithm with proven guarantees.

Findings

01

Effective identification of nonlinear mixtures demonstrated.

02

Neural network implementation achieves good performance.

03

Theoretical guarantees support the method's validity.

Abstract

Linear mixture models have proven very useful in a plethora of applications, e.g., topic modeling, clustering, and source separation. As a critical aspect of the linear mixture models, identifiability of the model parameters is well-studied, under frameworks such as independent component analysis and constrained matrix factorization. Nevertheless, when the linear mixtures are distorted by an unknown nonlinear functions -- which is well-motivated and more realistic in many cases -- the identifiability issues are much less studied. This work proposes an identification criterion for a nonlinear mixture model that is well grounded in many real-world applications, and offers identifiability guarantees. A practical implementation based on a judiciously designed neural network is proposed to realize the criterion, and an effective learning algorithm is proposed. Numerical results on synthetic…

Equations80

x_{j} = A s_{j}, j \in [N],

x_{j} = A s_{j}, j \in [N],

B \in R^{M \times r}, H \in R^{r \times N} min

B \in R^{M \times r}, H \in R^{r \times N} min

X = B H,

H \geq 0, H^{T} 1 = 1,

B = A Π, H = Π^{T} S,

B = A Π, H = Π^{T} S,

x_{j} = ϕ (A s_{j}), j \in [N],

x_{j} = ϕ (A s_{j}), j \in [N],

ϕ (x) = [ϕ_{1} (x (1)), \dots, ϕ_{M} (x (M))]^{T},

ϕ (x) = [ϕ_{1} (x (1)), \dots, ϕ_{M} (x (M))]^{T},

y_{j} = f (ϕ (A s_{j})), j \in [N] .

y_{j} = f (ϕ (A s_{j})), j \in [N] .

i = 1 \sum M ψ_{i} (a_{i}^{T} s) = 1, \forall s \in int Δ_{r},

i = 1 \sum M ψ_{i} (a_{i}^{T} s) = 1, \forall s \in int Δ_{r},

i = 1 \sum M k_{i} (a_{i}^{T} s) = 1, \forall s \in int Δ_{r} .

i = 1 \sum M k_{i} (a_{i}^{T} s) = 1, \forall s \in int Δ_{r} .

T_{k} (X)

T_{k} (X)

T_{k} (X)

T_{k} (X)

= (I + \frac{1}{1 - 1 _{M}^{T} b} b 1_{M}^{T}) D X

= W X

f_{i} (x) = 1/ M, i \in [M] .

f_{i} (x) = 1/ M, i \in [M] .

B \in R^{M \times r}, H \in R^{r \times N} min

B \in R^{M \times r}, H \in R^{r \times N} min

Y = B H,

H \geq 0, H^{T} 1 = 1 .

i = 1 \sum M d_{i} f_{i} (ϕ_{i} (a_{i}^{T} s))

i = 1 \sum M d_{i} f_{i} (ϕ_{i} (a_{i}^{T} s))

= 1^{T} D A s

= 1^{T} s

= 1 \forall s \in int Δ,

A^{T} d

A^{T} d

∥ d ∥_{0}

f_{1}, \dots, f_{M}

f_{1}, \dots, f_{M}

f_{i} \circ ϕ_{i} is all convex (or all concave) \forall i \in [M],

i = 1 \sum M f_{i} (x_{j} (i)) = 1 \forall j \in [N] .

k_{i}^{''} (x) = f_{i}^{''} (ϕ_{i} (x)) [ϕ_{i}^{'} (x)]^{2} + f_{i}^{'} (ϕ_{i} (x)) ϕ_{i}^{''} (x) .

k_{i}^{''} (x) = f_{i}^{''} (ϕ_{i} (x)) [ϕ_{i}^{'} (x)]^{2} + f_{i}^{'} (ϕ_{i} (x)) ϕ_{i}^{''} (x) .

f_{1}, \dots, f_{M}

f_{1}, \dots, f_{M}

f_{i} is invertible \forall i \in [M],

i = 1 \sum M f_{i} (x_{j} (i)) = 1 \forall j \in [N] .

\displaystyle\begin{split}\mathcal{F}=\left\{f\Bigg{\lvert}f(x)=\sum_{k=1}^{K}\alpha_{k}\sigma(\beta_{k}x+\gamma_{k})+\delta_{k},\right.\\ \left.~{}~{}\alpha_{k}>0,~{}\beta_{k}>0,~{}\forall k\in[K]\vphantom{\Bigg{\lvert}}\right\}\end{split}

\displaystyle\begin{split}\mathcal{F}=\left\{f\Bigg{\lvert}f(x)=\sum_{k=1}^{K}\alpha_{k}\sigma(\beta_{k}x+\gamma_{k})+\delta_{k},\right.\\ \left.~{}~{}\alpha_{k}>0,~{}\beta_{k}>0,~{}\forall k\in[K]\vphantom{\Bigg{\lvert}}\right\}\end{split}

γ _{k}^{i} , δ _{k}^{i} } { α _{k}^{i} , β _{k}^{i} , min

γ _{k}^{i} , δ _{k}^{i} } { α _{k}^{i} , β _{k}^{i} , min

α_{k}^{i} > 0, β_{k}^{i} > 0, \forall k \in [K], i \in [M] .

cone {x_{1}, \dots, x_{N}} =

cone {x_{1}, \dots, x_{N}} =

\displaystyle\left\{\bm{x}\Bigg{\lvert}\bm{x}=\sum_{j=1}^{N}\bm{x}_{j}\theta_{j},~{}\theta_{j}\geq 0,\forall j\in[N]\right\}.

conv {x_{1}, \dots, x_{N}} =

conv {x_{1}, \dots, x_{N}} =

\displaystyle\left\{\bm{x}\Bigg{\lvert}\bm{x}=\sum_{j=1}^{N}\bm{x}_{j}\theta_{j},~{}\sum_{j=1}^{N}\theta_{j}=1,~{}\theta_{j}\geq 0,\forall j\in[N]\right\}.

ζ (s_{1}, s_{2}, \dots, s_{r - 1}) := i = 1 \sum M ψ_{i} (a_{i}^{T} s) = 1, s \in int Δ_{r} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Learning Nonlinear Mixtures: Identifiability and Algorithm

Bo Yang 1, Xiao Fu 2, Nicholas D. Sidiropoulos 3, Kejun Huang 4

1Department of Electrical and Computer Engineering

University of Minnesota, Minneapolis, MN 55455, USA

Email: [email protected]

2School of Electrical Engineering and Computer Science

Oregon State University, Corvallis, OR 97330, USA

Email: [email protected]

3Department of Electrical and Computer Engineering

University of Virginia, Charlottesville, VA 22904, USA

Email: [email protected]

4Department of Computer and Information Science and Engineering

University of Florida, Gainesville, FL 32611, USA

Email: [email protected]

Abstract

Linear mixture models have proven very useful in a plethora of applications, e.g., topic modeling, clustering, and source separation. As a critical aspect of the linear mixture models, identifiability of the model parameters is well-studied, under frameworks such as independent component analysis and constrained matrix factorization. Nevertheless, when the linear mixtures are distorted by an unknown nonlinear functions – which is well-motivated and more realistic in many cases – the identifiability issues are much less studied. This work proposes an identification criterion for a nonlinear mixture model that is well grounded in many real-world applications, and offers identifiability guarantees. A practical implementation based on a judiciously designed neural network is proposed to realize the criterion, and an effective learning algorithm is proposed. Numerical results on synthetic and real-data corroborate effectiveness of the proposed method.

I Introduction

Linear mixture models (LMMs) have found numerous applications in machine learning and signal processing, e.g., topic mining, clustering, and source separation. When LMM is used for applications that are essentially parameter estimation (e.g., topic mining and community detection), it is critical to ensure that the generative model is uniquely identifiable. This is also found critical in many data mining problems [23, 32], as interpretability naturally relates to model uniqueness. However, LMM is not identifiable in general – even in the best case without noise: an LMM boils down to a matrix factorization (MF) model that is known to be unidentifiable, unless additional constraints on the factors are imposed.

Identifiability research for LMMs has a long and fruitful history in the confluence of machine learning, statistics, and signal processing. The arguably most notable line of work is independent component analysis (ICA) [12, 25], which is motivated by speech source separation. Statistical independence of latent parameters (i.e., different sources) is utilized to establish identifiability. LMM unmixing with correlated latent parameters has also been extensively studied, e.g., in the context of nonnegative matrix factorization (NMF) [15, 29, 2, 24, 18, 30, 32, 31], bounded component analysis (BCA) [13], and some other types of constrained MF models [19, 3].

Despite the relatively good understanding to the identifiability issues of different LMMs, the model is considered over-simplified in many applications. In many cases the observed data cannot be assumed to be approximately linear mixtures of some basis vectors, since nonlinear distortions exist due to a lot of reasons—e.g., multiplicative noise, clipping effect of sensors, and quantization, just to name a few. A natural question then is: under a reasonable nonlinear mixture model, can we identify the latent parameters of interest uniquely?

This question turns out to be highly nontrivial: most of the analytical tools in the linear mixture case do not apply. One exception is statistical independence of random variables, which is not affected by nonlinear distortion. Based on this observation, many works [34, 1, 26, 27] tackle nonlinear mixture model identification from a nonlinear ICA viewpoint. This line of work is very elegant, but it only answers our research question partially. Furthermore, statistical independence is considered restrictive, which is one of the main motivations for the extensive study of correlated components / sources as mentioned above.

Contributions.

In this work, we study the nonlinear mixture model learning problem, under a new setting that is rather different from ICA. Specifically, we study a nonlinear mixture model where the observed data vectors are convex combinations of a set of basis vectors followed by a nonlinear distortion. As mentioned, this kind of mixture model finds applications in MRI sensing, hyperspectral imaging, and statistical learning – and thus is very well-motivated. Our detailed contributions are

Identification criterion We propose a model identification criterion for the considered problem and provide sufficient conditions under which the model is identifiable. Our proof is a novel integration of functional equations [16, 28] and a generalization of LMM identifiability results, which is a fortuitous union that fits the considered nonlinear model well; 2. 2.

Neural network-based implementation We propose a neural network based formulation to implement the proposed criterion. The employed neural network is judiciously designed so that some specific constraints specified by the proposed identification criterion can be satisfied; 3. 3.

Numerical validation We reformulate the criterion to an easy-to-implement form and employ a trust region algorithm for solving the problem efficiently. We also tested the algorithm on both synthetic and real data to show effectiveness of the approach.

Another salient feature of our method is that it turns the unsupervised parameter estimation problem into a supervised regression problem, which requires little new algorithmic design – see Section III-E for more information.

Notation.

Bold capital letters represent matrices, while bold lowercase letters denote vectors, which are assumed to be column vectors, unless transposed with $(.)^{\textup{\sf T}}$ . Plain lowercase letters denote scalars. $\bm{X}$ and $\bm{x}$ refer to the observed data, and $\bm{A}$ , $\bm{a}_{i}$ , $\bm{S}$ , $\bm{s}_{i}$ refer to the underlying latent parameters. Symbol $\bm{\phi}$ denotes the unknown nonlinear function in data generation, and $\bm{f}$ denotes the learning function, which tries to counteract the nonlinear effects in $\bm{\phi}$ . Symbol $\bm{Y}$ represents the data transformed by the learning function $\bm{f}$ , i.e. $\bm{Y}=\bm{f}(\bm{X})$ , and $\bm{k}$ denotes the composite function of $\bm{f}$ and $\bm{\phi}$ . Symbol $[N]$ denotes the set of integers $\{1,\cdots,N\}$ . The vector-valued functions we consider in this work are all element-wise, and we use the notation $\bm{f}=[f_{1},\cdots,f_{M}]^{\textup{\sf T}}$ to mean that $[\bm{f}(\bm{x})](i)=f_{i}(\bm{x}(i))$ for $\bm{x}\in\mathbb{R}^{M}$ and $i\in[M]$ . The symbol $\|\cdot\|_{0}$ denotes the $\ell_{0}$ norm, i.e. the number of nonzeros, of a vector or matrix. The symbol $\text{cone}(\bm{X})$ denotes the set formed by conical combination of columns of $\bm{X}$ . Finally, $\bm{0}$ ( $\bm{1}$ ) denotes a vector (or matrix) of all 0’s (1’s).

II Preliminaries

We briefly review existing parameter identification results that are related to this work. Relevant concepts in convex geometry can be found in the appendix.

To facilitate discussion, we use $\Delta_{M}:=\left\{\bm{x}|\bm{x}\in\mathbb{R}^{M},~{}\bm{x}\geq\bm{0},~{}\bm{1}^{\textup{\sf T}}\bm{x}=1\right\}$ to denote the $(M-1)$ probability simplex. The LMM is defined as

[TABLE]

where $\bm{A}\in\mathbb{R}^{M\times r}$ is often a tall matrix, i.e., $M>r$ , and $\bm{s}_{j}\in\Delta_{r}$ . Alternatively, we will also write $\bm{X}=\bm{A}\bm{S}$ by collecting all $\bm{x}_{j}$ ’s into $\bm{X}$ , and $\bm{s}_{j}$ ’s into $\bm{S}$ .

In order to characterize identifiability of (1), let us introduce the following definition.

Definition 1

*(Sufficiently scattered, [18, 23]) Let matrix $\bm{S}\in\mathbb{R}_{+}^{r\times N}$ , where $\mathbb{R}_{+}^{r\times N}$ is the nonnegative subset of $\mathbb{R}^{r\times N}$ . Matrix $\bm{S}$ is said to be sufficiently scattered (SS) if $\text{cone}(\bm{S})$ satisfies:

(a) $\mathcal{C}\subseteq\text{cone}(\bm{S})$ , where $\mathcal{C}$ is a second order cone: $\mathcal{C}=\{\bm{x}\in\mathbb{R}^{r}|\bm{1}^{\textup{\sf T}}\bm{x}\geq\sqrt{r-1}\|\bm{x}\|_{2}\},$

(b) $\text{cone}(\bm{S})\subsetneq\text{cone}(\bm{Q})$ , for any unitary matrix $\bm{Q}\in\mathbb{R}^{r\times r}$ that is not a permutation matrix.

Roughly speaking, this condition requires that the column of $\bm{S}$ are spread out on the probability simplex. This condition is in fact fairly relaxed, as discussed in [22].

To recover factors $\bm{A}$ and $\bm{S}$ from data $\bm{X}=[\bm{x}_{1},\cdots,\bm{x}_{N}]$ , the following so-called Volume Minimization (VolMin, [18]) criterion is often employed:

[TABLE]

where it is assumed that $r$ is known. The term $\text{Vol}(\bm{B})$ is a measure of the volume of the simplex formed by using columns of $\bm{B}$ as vertices, see [6]. This criterion suggests that we want to find $\bm{B}$ and $\bm{H}$ that satisfy the LMM, and we pick the solution with minimal volume, hence the name VolMin.

Based on this VolMin criterion, the following theorem established identifiability of model(1).

Theorem 1

([18]) Let the matrices $\bm{A}$ and $\bm{S}$ satisfy $\text{rank}(\bm{A})=\text{rank}(\bm{S})=r$ . Suppose $\bm{S}$ satisfies the SS condition. Under the generative model (1), the VolMin criterion (II) uniquely identifies both $\bm{A}$ and $\bm{S}$ up to a permutation. Specifically, any optimal solution to (II) takes the form

[TABLE]

where $\bm{\Pi}$ is a permutation matrix.

A proof of this result can be found in [18]. We mention that by Theorem 1, given that $\bm{S}$ satisfies SS, the only remaining indeterminacy is a permutation of the columns (rows) of $\bm{A}$ (resp. $\bm{S}$ ), which is unavoidable – but also inconsequential in most applications.

Several algorithms for dealing with (II) have been developed, and we will use the so-called minimal volume enclosing simplex (MVES): Given data $\bm{X}$ and the rank parameter $r$ , the MVES algorithm returns a solution $(\widehat{\bm{B}},\widehat{\bm{H}})$ of (II). We refer readers to [8] for more on MVES due to page limitations.

III The nonlinear mixture model

III-A The model

We introduce a new data model to handle nonlinear effects in various applications. Specifically, the data model is

[TABLE]

where $\bm{A}\in\mathbb{R}^{M\times r}$ satisfies $\bm{A}\geq\bm{0}$ , and $\bm{s}_{j}\in\Delta_{r},~{}\forall j\in[N]$ . The function $\bm{\phi}$ is a nonlinear mapping $\bm{\phi}:\mathbb{R}^{M}\rightarrow\mathbb{R}^{M}$ , and we consider element-wise nonlinearity, i.e., $\bm{\phi}=[\phi_{1},\phi_{2},\cdots,\phi_{M}]^{\textup{\sf T}}$ , so that

[TABLE]

where $\bm{x}=[\bm{x}(1),\cdots,\bm{x}(M)]^{\textup{\sf T}}$ . For notational brevity, we use the shorthand ${\bm{X}}=\bm{\phi}(\bm{A}\bm{S})$ to denote (3), where it should be noted that the $\bm{\phi}$ is applied on each column of $\bm{A}\bm{S}$ .

Model (3) is well motivated. It can be viewed as a generalization of (1), which is used in various applications. In hyperspectral unmixing (HU), each $\bm{x}_{j}$ is a hyperspectral pixel, each column of $\bm{A}$ represents the frequency signature of a certain material (e.g. soil, vegetation, water), and each $\bm{s}_{j}$ denotes the proportion of materials in that pixel $\bm{x}_{j}$ , see e.g. [5, 31]. In magnetic resonance imaging (MRI), LMM is used due to the so called “partial volume effect” [9, 35, 33], which gives rise to the condition $\bm{s}_{j}\in\Delta_{r}$ . Both these applications are of great importance in their respective research fields, where considerable work has been done based on (1). Yet, it is widely recognized that in many real world scenarios, the LMM in (1) is oversimplified, see [14]. For example, in HU and MRI, the measurements $\bm{x}_{j}$ ’s are obtained by sensors, which have inherent nonlinearity due to physical limitations of the measuring devices. By explicitly modeling this nonlinearity, we expect methods that are based on (3) to give improved results in these tasks.

For faithful modeling purpose, (3) adds the mapping $\bm{\phi}$ to (1), which renders (3) flexible in covering many important applications, as discussed above. However, it is clear that the additional $\bm{\phi}$ brings considerable complication in recovering $\bm{A}$ and $\bm{S}$ . Before pursuing a general result, let us make some simple observations. First, for many nonlinear $\bm{\phi}$ , it is not possible to recover $\bm{A}$ and $\bm{S}$ , e.g., $\bm{\phi}(\bm{x})=\bm{0},~{}\forall~{}\bm{x}$ . Hence one of the tasks is to impose on $\bm{\phi}$ reasonable and practical conditions, under which recovery is possible. Second, if $\bm{\phi}$ is linear, by the element-wise assumption, we have $\bm{X}=\bm{D}\bm{A}\bm{S}$ , where $\bm{D}$ is a diagonal matrix. From here, we can see that there are scaling ambiguities on the rows of $\bm{A}$ , even for the simplest $\bm{\phi}$ . In light of this, a crucial question about model (3) is which parts (or aspects) of $\bm{A}$ and $\bm{S}$ can be identified, and to what extend?

III-B Functional equations on a simplex

We aim at identifying parameters from (3) in an unsupervised fashion. Towards that end, we will try to learn an adjustable function $\bm{f}$ , and denote

[TABLE]

The remaining question is how to devise a learning method such that the resulting $\bm{f}$ will ‘counteract’ the nonlinear effect brought by $\bm{\phi}$ . If this can be done, we can then employ methods designed for LMM (1) to separate the latent factors. Towards this goal, we first introduce a technical lemma.

Consider the following functional equation concerning functions $\psi_{1},\cdots,\psi_{M}$ and variables $\bm{s}\in\text{int}~{}\Delta_{r}$

[TABLE]

where $\text{int}~{}\Delta_{r}$ denotes the interior of $\Delta_{r}$ . To facilitate presentation, let $\bm{A}:=[\bm{a}_{1},\bm{a}_{2},\cdots,\bm{a}_{M}]^{\textup{\sf T}}\in\mathbb{R}^{M\times r}$ .

Lemma 1

*Suppose (6) holds, and $M\geq r\geq 3$ . Let us further assume that

(a) the functions $\psi_{1},\cdots,\psi_{M}$ are twice differentiable, and are all convex (or all concave) in the domain $(0,1)$ ; and

(b) $\bm{A}$ is nonnegative and has two positive columns.

Then the functions $\psi_{1},\cdots,\psi_{M}$ are all affine.*

The proof can be found in the appendix.

III-C Nonlinear mixture model identification

To proceed, let us suppose that the learning function $\bm{f}:\mathbb{R}^{M}\rightarrow\mathbb{R}^{M}$ in (5) is also element-wise, i.e., $\bm{f}=[f_{1},f_{2},\cdots,f_{M}]^{\textup{\sf T}}$ , where $f_{i}$ ’s are univariate functions. Denote $\bm{k}=[k_{1},k_{2},\cdots,k_{M}]^{\textup{\sf T}}:\mathbb{R}^{M}\rightarrow\mathbb{R}^{M}$ , where $k_{i}=f_{i}\circ{\phi}_{i}$ , and $\circ$ denotes function composition. Let us make the following assumptions about the generative model (3).

(A1)

The functions $\phi_{1},\cdots,\phi_{M}$ are all invertible, and twice differentiable. 2. (A2)

The matrix $\bm{A}\in\mathbb{R}^{M\times r}$ in (3) satisfies $\bm{A}\geq\bm{0}$ , has two positive columns, and is incoherent (see Def. 2). The dimensions satisfy $M\geq r\geq 3$ . 3. (A3)

The columns of $\bm{S}$ satisfy $\bm{s}_{j}\in\text{int}~{}\Delta_{r},~{}\forall j\in[N]$ . Moreover, $\bm{s}_{j}$ ’s are sampled from a Dirichlet distribution with parameters $\bm{\mu}=[\mu_{1},\mu_{2},\cdots,\mu_{r}]$ .

For brevity, let us define a matrix function that has $\bm{k}$ acting on the columns of its matrix argument, $\bm{T_{k}}(\bm{X})=[\bm{k}(\bm{x}_{1}),\bm{k}(\bm{x}_{2}),\cdots,\bm{k}(\bm{x}_{N})]$ for $\bm{X}=[\bm{x}_{1},\bm{x}_{2},\cdots,\bm{x}_{N}]$ . We are ready to state the following results.

Theorem 2

*(Main results) Under assumptions (A1), (A2), (A3), and supposing that after performing a certain training procedure (see Section III-E) on $f_{1},f_{2},\cdots,f_{M}$ , the output satisfies *

[TABLE]

Furthermore, assume that the composite functions $k_{i}$ ’s are all convex (or all concave). Then the following hold

(a)

The functions $k_{1},k_{2},\cdots,k_{M}$ are affine; 2. (b)

The functions $\phi_{1}^{-1},\cdots,\phi_{M}^{-1}$ are identified up to an affine transformation, i.e. $f_{i}(x)=d_{i}\phi_{i}^{-1}(x)+b_{i},~{}\forall i\in[M]$ , where $d_{i}$ ’s and $b_{i}$ ’s are constants.

The proof can be found in the appendix. A remark about function $\bm{T}_{\bm{k}}$ is in order.

Remark 1

According to (a) in Theorem 2, we can write

[TABLE]

where $\bm{D}=\text{diag}(d_{1},\cdots,d_{M})$ , and $\bm{b}=[b_{1},\cdots,b_{M}]^{\textup{\sf T}}$ , and $d_{i}$ and $b_{i}$ are coefficients for the affine function $k_{i}$ . Equation (8) suggests that $\bm{T}_{\bm{k}}$ is an affine function in $\bm{X}$ . However, we would like $\bm{T}_{\bm{k}}$ to be linear in $\bm{X}$ , instead of affine, as later we show that it is possible to identify parameters in LMM under invertible linear transformation (Lemma 2).

Fortunately, for signal model (3) satisfying (A1), (A2) and (A3), we can see that $\bm{T}_{\bm{k}}(\bm{X})$ is indeed a linear function of $\bm{X}$ . Let us consider a matrix $\bm{X}\in\mathbb{R}^{M\times N}$ . Due to equation (7), we have $\bm{1}_{M}^{\textup{\sf T}}\bm{T}_{\bm{k}}(\bm{X})=\bm{1}_{M}^{\textup{\sf T}}\bm{D}\bm{X}+\bm{1}_{M}^{\textup{\sf T}}\bm{b}\bm{1}_{N}^{\textup{\sf T}}=\bm{1}_{N}^{\textup{\sf T}}$ , which means $\bm{1}_{N}^{\textup{\sf T}}=\bm{1}_{M}^{\textup{\sf T}}\bm{D}\bm{X}/(1-\bm{1}_{M}^{\textup{\sf T}}\bm{b})$ . Plugging this into the above equation, we have

[TABLE]

where we define $\bm{W}:=\left(\bm{I}+\frac{1}{1-\bm{1}^{\textup{\sf T}}_{M}\bm{b}}\bm{b}\bm{1}_{M}^{\textup{\sf T}}\right)\bm{D}$ , and $\bm{1}_{M}$ is an all-one vector of length $M$ . The above equation suggests that $\bm{T_{k}}$ is linear in $\bm{X}$ . A subtle point is that the above calculation is invalid when $1=\bm{1}^{\textup{\sf T}}_{M}\bm{b}$ holds exactly, but this is extremely unlikely since $\bm{b}$ will be resulted from a numerical algorithm.

We will propose a method to make (7) (approximately) hold in Section III-E. Let us briefly discuss the roles of the assumptions. For (A1), the invertibility condition is important, as one in general cannot hope to recover the unknown parameters if they undergo non-invertible transformations. The twice differentiable condition on $\phi_{i}$ ’s is to make $k_{i}$ ’s twice differentiable, when suitable $f_{i}$ ’s are learned. This is also natural, as it requires the nonlinear functions in data generation to be smooth.

Assumption (A2) is the same as in Lemma 1, except for the additional incoherent assumption. The incoherence assumption is important, as it ensures that solutions that satisfy (7) exist, see detailed discussion in Section III-D. The condition that it should have two positive columns may seem strange, but it is easily satisfied if, say, $\bm{A}$ is generated from an absolutely continuous distribution, supported on the nonnegative orthant. For (A3), the Dirichlet distribution is assumed because it gives samples on the probability simplex. In addition, this assumption ensures that the columns of $\bm{S}$ cover the entire interior of $\Delta_{r}$ as $N\rightarrow+\infty$ , which plays a role when characterizing the asymptotic identification guarantee of the proposed method as in Corollary 1.

Given the generative model (3), Theorem 2 essentially asserts that if we require $\bm{1}^{\textup{\sf T}}\bm{y}=1$ for all input $\bm{s}$ , then the learned functions $f_{1},\cdots,f_{M}$ will remove the nonlinearity in functions $\phi_{1},\cdots,\phi_{M}$ . But our main goal is identifying parameters in the latent LMM; $\bm{T}_{\bm{k}}$ being linear is not enough. To see this more clearly, suppose we get a solution for $f_{i}$ ’s of this form

[TABLE]

In this case, $k_{i}$ ’s are all constant functions, and hence convex. Moreover, for this solution (10), we have $\bm{k}(\bm{A}\bm{s})=\bm{D}\bm{A}\bm{s}+\bm{b}$ , where $\bm{D}=\bm{0}$ and $\bm{b}=(1/M)\bm{1}$ ; meaning that $\bm{f}$ maps all input $\bm{x}=\bm{\phi}(\bm{A}\bm{s})$ to the single point $\bm{y}=(1/M)\bm{1}$ , which does satisfy (7).

The problem we identify here is important: we need additional constraints on $\bm{y}$ beyond $\bm{1}^{\textup{\sf T}}\bm{y}=1$ , so that $\bm{y}$ preserves information about the original data $\bm{x}$ , as only then we can hope to identify $\bm{A}$ and $\bm{s}$ from $\bm{y}$ . We propose a method to remedy this in Section III-E.

To proceed with parameter estimation, let us provide the following lemma, concerning parameter identifiability of LMM (1) under a linear transformation.

Lemma 2

Consider the LMM model $\bm{X}=\bm{A}\bm{S}$ , where $\bm{A}\in\mathbb{R}^{M\times r}$ and $\bm{S}\in\mathbb{R}^{r\times N}$ satisfies the SS condition, and $\text{rank}(\bm{A})=\text{rank}(\bm{S})=r$ . Let $\bm{Y}=\bm{W}\bm{X}$ , where $\bm{W}\in\mathbb{R}^{M\times M}$ is nonsingular. Then we can identify $\widetilde{\bm{A}}=\bm{W}\bm{A}$ and $\bm{S}$ up to column permutation by solving

[TABLE]

That is, suppose $(\bm{B}^{*},\bm{H}^{*})$ is an optimal solution of the above problem, then $\bm{B}^{*}=\widetilde{\bm{A}}\bm{\Pi}$ and $\bm{H}^{*}=\bm{\Pi}^{\textup{\sf T}}\bm{S}$ , where $\bm{\Pi}$ is a permutation matrix.

This lemma is a direct consequence of Theorem 1. It suggests when the original model $\bm{X}=\bm{A}\bm{S}$ is identifiable, then after an invertible linear transformation $\bm{W}$ , we can still identify $\bm{S}$ using VolMin; but it is not possible to identify $\bm{A}$ due to the linear transformation $\bm{W}$ . This lemma also suggests that we can employ an algorithm designed to tackle LMM to identify $\bm{S}$ , once the nonlinear effects in (3) have been removed, and only an unknown linear transformation is left.

III-D Feasibility of (7)

Results in Theorem 2 hinge on equation (7). One could be wondering, giving the conditions outlined in assumptions (A1), (A2), and (A3), does there exist $\bm{f}$ such that (7) hold? This amounts to study feasibility of (7), which is not obvious. For instance, consider the naturally guessed solution $\{\widehat{f}_{i}=\phi^{-1}_{i},~{}\forall i\}$ , for which we have $\bm{T}_{\bm{k}}(\bm{X})=\bm{X}$ ; but we don’t have $\sum_{i=1}^{M}k_{i}(\bm{a}_{i}^{\textup{\sf T}}\bm{s})=\sum_{i=1}^{M}\bm{a}_{i}^{\textup{\sf T}}\bm{s}=1,~{}\forall s\in\text{int}~{}\Delta_{r}$ without imposing more restrictive assumptions on $\bm{A}$ or $\bm{S}$ . This means that, for this natural guess, (7) does not hold.

To study this feasibility issue, we note that if there exists a diagonal matrix $\bm{D}$ , such that $\bm{1}^{\textup{\sf T}}\bm{D}\bm{A}=\bm{1}^{\textup{\sf T}}$ , then letting $\widetilde{f}_{i}=\phi_{i}^{-1}$ , we have

[TABLE]

where $d_{i}$ is the $i$ -th diagonal element of $\bm{D}$ . Hence, the functions $\left\{\widehat{f}_{i}(\cdot)=d_{i}\widetilde{f}_{i}(\cdot),~{}i\in[M]\right\}$ satisfy (7). An additional requirement is that $\{d_{i}\neq 0,\forall i\}$ , otherwise we can get a trivial solution, as explained in the above section.

Building on the above observation, the feasibility problem of (7) boils down to establishing existence of a nonsingular diagonal matrix $\bm{D}$ (i.e. $d_{i}\neq 0,\forall i$ ), such that $\bm{1}^{\textup{\sf T}}\bm{D}\bm{A}=\bm{1}^{\textup{\sf T}}$ , for matrix $\bm{A}$ that satisfies assumption (A2). We present Proposition 1, which shows that with a mild incoherence condition (see Definition 2) on $\bm{A}$ , such desired $\bm{D}$ indeed exists. We start by providing the following definition of incoherence.

Definition 2

(Incoherence) A tall and full-rank matrix $\bm{A}\in\mathbb{R}^{m\times r}$ is a said to be incoherent if $\bm{e}_{j}\notin\text{Range}(\bm{A}),~{}\forall j\in[m]$ .

Note that here incoherence is defined in the same spirit as the incoherence found in well-known compressed sensing literature, see e.g. [7].

We are now ready to state the following proposition. Here we write $\bm{A}^{\textup{\sf T}}\bm{d}=\bm{1}_{r}$ instead of $\bm{1}^{\textup{\sf T}}\bm{D}\bm{A}=\bm{1}^{\textup{\sf T}}$ for conciseness: existence of nonsingular diagonal $\bm{D}$ is the same as existence of fully dense $\bm{d}$ .

Proposition 1

For a tall, full rank, and incoherent matrix $\bm{A}\in\mathbb{R}^{m\times r}$ , there exists a vector $\bm{d}\in\mathbb{R}^{m}$ , such that

[TABLE]

Note that by assumption, $\bm{A}$ is tall and full rank, so there are infinitely many $\bm{d}$ vectors satisfy (13a). However, it is not obvious if there is always a fully dense $\bm{d}$ (i.e. (13b)) such that (13a) holds for any $\bm{A}$ that is tall and full rank.

The proof of Proposition 1 can be found in appendix.

Remark 2

We establish that for an incoherent $\bm{A}$ , there always exist solutions to make (7) hold. Moreover, we point out that even for some $\bm{A}$ that is not incoherent, solutions for (7) might also exist. For example, if one or more columns of $\bm{A}$ are some columns of an identity matrix, then $\bm{A}$ is not incoherent. However, if we have $\bm{1}^{\textup{\sf T}}\bm{A}=\bm{1}^{\textup{\sf T}}$ – which is true when all columns of $\bm{A}$ are some columns of an identity matrix – then we see that $\{f_{i}=\phi_{i}^{-1},~{}\forall i\}$ is a feasible solution.

III-E Learning algorithm

Theorem 2 suggests the following optimization formulation to learn desired $\bm{f}$

[TABLE]

For this formulation we have the following claim.

Corollary 1

For problem (III-E), suppose the data $\bm{X}=[\bm{x}_{1},\cdots,\bm{x}_{N}]\in\mathbb{R}^{M\times N}$ admit model (3) and assumptions (A1), (A2), (A3) hold. Suppose $N\rightarrow+\infty$ , the optimal solutions to (III-E) satisfy (7), and the resulting $\{k_{i}=f_{i}\circ\phi_{i},~{}\forall i\in[M]\}$ are all affine.

This corollary follows from the distributional assumption (A3) on $\bm{s}_{j}$ . As $N\rightarrow+\infty$ , $\bm{s}_{j}$ will cover all the interior of $\Delta_{r}$ with probability 1. Then the constraints in (III-E) become the same as the conditions in Theorem 2. Corollary 1 thus guarantees the nonlinear function identification property of formulation III-E in an asymptotic sense. In the following, we approximate problem III-E to make it amenable to numerical algorithms. In Section IV, we give numerical examples, showing that even with finite $N$ , the proposed method works remarkably well.

Problem formulation III-E suggests that we need to find functions $f_{1},\cdots,f_{M}$ , such that the output sums to one. To enforce the constraint that $k_{i}$ ’s are all convex (or all concave), we note

[TABLE]

To make sure $k_{i}$ is convex (or concave), we need $k_{i}^{\prime\prime}(x)\geq 0$ (or $k_{i}^{\prime\prime}(x)\leq 0$ ), which requires us to know the sign of $\phi_{i}^{\prime\prime}(x)$ . For instance, suppose $\phi_{i}^{\prime\prime}(x)\leq 0$ , then we can pick a parametric family for $f_{i}$ ’s, such that $f_{i}^{\prime\prime}(x)\geq 0$ and $f_{i}^{\prime}(x)\leq 0$ . Then we have $k_{i}^{\prime\prime}(x)\leq 0$ , i.e. $k_{i}$ is concave. Similarly, we can constrain $f_{i}$ ’s for all $i\in[N]$ to make sure $k_{i}$ ’s are all convex (or concave). To simplify implementation, we adopt an approximation: We only require $f_{i}$ ’s to be invertible in this work. This leads to the following optimization problem.

[TABLE]

In other words, we aim at learning invertible functions that add to one. The invertibility condition is crucial, otherwise we can obtain trivial solutions, as explained before.

To parametrize functions $f_{j}$ , we will adopt Neural Networks (NN) with one hidden layer, due to their universal approximation capability [21, 4]. In particular, we employ the following parametric function family

[TABLE]

where $K$ is the number of neurons, $\{\alpha_{k},\beta_{k},\gamma_{k},\delta_{k}\}_{k=1}^{K}$ are the learnable parameters of this NN, and $\sigma$ denotes the nonlinearity. Importantly, the constraints on $\alpha_{k}$ and $\beta_{k}$ are to ensure invertibility, as stated below.

Lemma 3

In (17), if $\sigma^{\prime}(x)>0,~{}\forall x$ , the functions in $\mathcal{F}$ are all invertible.

The above lemma can be easily seen to be true. By definition, we have $f^{\prime}(x)=\sum_{k=1}^{K}\alpha_{k}\beta_{k}\sigma^{\prime}(\beta_{k}x+\gamma_{k})$ . For $\sigma^{\prime}(x)>0$ , we have $f^{\prime}(x)>0$ if $\alpha_{k}>,~{}\beta_{k}>0,~{}\forall k\in[K]$ . Note that the requirement for $\sigma^{\prime}(x)>0$ is easily satisfied for commonly used neurons, e.g., $\text{tanh}(\cdot)$ and the sigmoid function. For this reason, we pick $\sigma$ as $\text{tanh}(\cdot)$ in this work.

Utilizing the parametric family $\mathcal{F}$ in (17), we arrive at the following optimization problem

[TABLE]

This is a nonlinear least-squares regression problem, with bound constraints. We employ a trust-region algorithm [11] for optimization.

After obtaining parameters $\{\widehat{\alpha}_{k}^{i},\widehat{\beta}_{k}^{i},\widehat{\gamma}_{k}^{i},\widehat{\delta}_{k}^{i}\}$ via (III-E), we obtain $\widehat{f}_{i}(x)=\sum_{k=1}^{K}\widehat{\alpha}_{k}^{i}\sigma(\widehat{\beta}_{k}^{i}x+\widehat{\gamma}_{k}^{i})+\widehat{\delta}_{k}^{i}$ , and form the transformed data $\bm{Y}=\bm{\widehat{f}}(\bm{X})$ . Theorem 2 predicts that $\bm{Y}\approx\bm{W}\bm{A}\bm{S}$ for some nonsingular matrix $\bm{W}$ . From Lemma 2, we see that we can employ an algorithm for LMM to identify $\bm{S}$ . For this purpose, we employ the classical MVES algorithm [8] for LMM, and obtain an estimate $\widehat{\bm{S}}$ .

The overall procedure is summarized in Algorithm 1. We emphasize again that the method is unsupervised: The only data is $\bm{X}$ , not $\{\bm{x}_{j},y_{j}\}_{j=1}^{N}$ (feature-label pairs) as in, e.g., the generalized additive models [20, Ch. 9] setting, or recent works on nonlinear estimation [36, 10].

IV Numerical experiments

IV-A Synthetic data study

We start by providing a qualitative assessment of the proposed theory and algorithm. For this purpose, we will visualize the learned functions to see if nonlinearity in data generation is indeed resolved. We randomly generate $\bm{S}$ according to a Dirichlet distribution – such that the generated $\bm{s}_{j}$ ’s are nonnegative and sum to one. The dimensions are $M=r=4$ and $N=1000$ . The parameter of this Dirichlet distribution is set to $\bm{\mu}=[0.1,0.1,0.1,0.1]$ , so that the generated $\bm{s}_{j}$ ’s are well spread on the probability simplex, hence SS is likely to be satisfied. For this experiment, we take $\bm{A}$ to be $\bm{A}=2\bm{I}_{4}$ . The four nonlinear functions in data generation are $\phi_{1}(x)=x$ , $\phi_{2}(x)=\sqrt{x}$ , $\phi_{3}(x)=\sqrt[4]{x}$ , and $\phi_{4}(x)=\log(x+1)$ . Note that these functions are not revealed to the learning algorithm, and are only used to visualize the results after learning is completed. For learning, each function $f_{i}$ is parametrized by a constrained one-hidden-layer NN defined in (17), with $K=20$ neurons. The learned functions $f_{1}\cdots f_{4}$ and the composite functions $f_{1}\circ\phi_{1}\cdots f_{4}\circ\phi_{4}$ are shown in Figure 1.

One can immediately see that the learned functions indeed resolve nonlinearity in data generating nonlinear functions: The learned $f_{1}$ is a linear function since $\phi_{1}$ is a linear function; the other learned functions all look similar to the corresponding inverse functions of $\phi_{i}$ ’s. Moreover, one can clearly see that the composite functions all look affine.

Next, we test the parameter estimation performance. For this experiment, we generate data with five different nonlinear functions:

(a) $e^{x}$ ,

(b) $x+x^{2}$ ,

(c) $\log(e^{x}+1)$ ,

(d) $\log(x+1)$ ,

(e) $x+\text{tanh}(x)$ .

For each case, one of the five functions are used for all coordinates (features), i.e. $\phi_{1}=\cdots=\phi_{M}$ . The parameter settings are $M=10$ , $N=1000$ , and $r=4$ . We generate $\bm{A}\in\mathbb{R}^{10\times 4}$ by sampling a standard normal distribution for each entry, and then take the absolute values, followed by a column normalization step. $\bm{S}$ is similarly generated as in the first experiment. For this experiment, the $f_{i}$ functions are constrained to be the same: a constrained one-hidden-layer NN defined in (17), with $K=40$ for all cases, to avoid unrealistic parameter tuning. In other words, all the NN share the same parameters. Since problem (III-E) is nonconvex, different initialization could lead to different results. For this reason, the formulation (III-E) is optimized five times with different random initialization, and the result of smallest cost function value is used for subsequent steps of Algorithm 1. The performance metric we employ is mean squared error (MSE): $\text{MSE}=\frac{\|\widehat{\bm{S}}-\bm{S}\|_{F}^{2}}{rN}$ .

Since our method is the first work dealing with this nonlinear model, the only baseline we employ is MVES without considering nonlinear effects. The motivation is to see if it is indeed possible to estimate parameters with unknown nonlinear functions, using only nonlinearly distorted data $\bm{X}$ . For each setting, $100$ trials with different randomly generated data (see appendix for details) are performed, and the empirical cumulative distribution function (CDF) of the resulting MSEs are reported in Figure 2.

From Figure 2, one can see that the proposed method yields significant improvements over applying MVES directly, in all the cases. Note that the x-axis in Figure 2 is $\log_{10}(\text{MSE})$ , hence our method yields several order of magnitude improvement in accuracy over the baseline. There are a few trials where the proposed method yields relatively larger error, which is likely caused by numerical difficulties in optimizing NNs.

IV-B Case study with a hyperspectral image

We next perform an experiment on hyperspectral unmixing (HU). Unlike normal RGB images, a pixel in a hyperspectral image contains information on hundreds of spectral bands. With the more detailed spectral information, it is reasonable to assume that different materials have their distinct spectral signature. Physically, each pixel represents a convex combination of materials that are present for the geographical region. However, it is known that the collected measurement may encounter nonlinear distortion. The HU task involves separating materials of a ground region.

The image employed in this experiment is the Moffett Field captured in France – a standard benchmark for testing HU algorithms. The region has three main materials: water, soil, and vegetation. This scene is known for the existence of nonlinear mixture pixels - which usually poses a challenge to LMM-based HU algorithms such as MVES. The size of the image is $50\times 50$ , hence we have 2500 pixels. Each pixel is measured on 224 spectral bands. Following commonly applied preprocessing steps [17], we remove the water-absorbing bands, and end up with a matrix $\bm{X}$ of size $200\times 2500$ , so that each of the remaining 200 spectral bands serves as a feature for that pixel. The algorithms are supposed to identify what materials are present in each pixel, and the proportion of the presenting materials.

To apply our method, we use the same $f_{i}$ on each of the $200$ feature as above, and fix $K=40$ . We compare our method with MVES, since MVES is one of the best performing methods for HU. After obtaining the estimated $\bm{S}$ , we inspect each row of $\bm{S}$ to determine which of them corresponds to the water, soil, and vegetation portion of the image. The difference between the two sets of results is most visible in the estimated soil distribution (a particular row of estimated $\bm{S}$ ) as shown in Figure 3: the result by MVES outputs large values in the water region. The proposed method outputs much smaller values in the water region, which is much more aligned with reality.

We further plot the estimated $\bm{S}$ in the known water region (top $15\times 50$ part111We take this part as it is clear that there is only one material (water) in this region, so the ground truth for each column of $\bm{S}$ is any permutation of $[1,0,0]^{\textup{\sf T}}$ . of Figure 3), as shown in Figure 4. Since columns of $\bm{S}$ live in a dimension-2 simplex, we project all the points into a 2D space, with the vetices of the triangle corresponding to the original vetices in the 3D space, as shown in Figure 4. Note that Figure 3 shows a single estimated row of $\widehat{\bm{S}}$ for easy visualization, while Figure 4 presents results from all rows, for the part that corresponds to the top $15\times 50$ region. From this figure, we see that results of the proposed method coalesce around a coordinate vector $[0,0,1]^{\textup{\sf T}}$ , which means that proposed method is quite certain that there is only one material in this region (which is true); while MVES is much less confident, as the points are much far away from a coordinate vector. The estimated $\bm{S}$ also indicates that MVES fails to clearly separate soil and water spectral signatures (columns of $\bm{A}$ ), whereas our method performs much better.

V Conclusion

This work serves as a first attempt to unravel latent structures in data when the observations are distorted with unknown nonlinear effects. It is an important problem to consider in practice, but a concrete study is solely missing prior to this work. Much to one’s surprise, this seemingly impossible mission of figuring out unknown nonlinearities can actually be accomplished up to affine transformations, as we showed in this paper. A learning algorithm based on the powerful artificial neural networks is proposed to rectify the unknown nonlinear functions. Our carefully designed numerical experiments show clear advantage in terms of inverting nonlinear distortions and identifying latent factors in LMMs altered by unknown nonlinear effects.

Appendix: “Learning Nonlinear Mixtures: Identifiability and Algorithm”

Some definitions in convex geometry

Definition 3

(Convex cone) The convex cone of $\{\bm{x}_{1},\cdots,\bm{x}_{N}\}$ is defined as

[TABLE]

Definition 4

(Convex hull) The convex hull of $\{\bm{x}_{1},\cdots,\bm{x}_{N}\}$ is defined as

[TABLE]

Definition 5

(Simplex) A convex hull $\text{conv}\{\bm{x}_{1},\cdots,\bm{x}_{N}\}$ is called a simplex if $\bm{x}_{1},\cdots,\bm{x}_{N}$ are affinely independent, i.e., $\bm{x}_{1}-\bm{x}_{N},\cdots,\bm{x}_{N-1}-\bm{x}_{N}$ are linearly independent.

A probability simplex is a special simplex, with all vertex vectors being the coordinate vectors, i.e. $\forall i\in[N],~{}\bm{x}_{i}=\bm{e}_{j}$ for some $j$ , where $\bm{e}_{j}$ has $1$ at its $j$ -th coordinate, and [math] for all other coordinates.

Proofs

Proof of Lemma 1: Assume without loss of generality that the two nonzero columns are the first and second column. Let us denote

[TABLE]

Note that $\zeta$ is a function of $(r-1)$ variables $s_{1},\cdots,s_{r-1}$ , since $\bm{1}^{\textup{\sf T}}\bm{s}=1$ . Equation (21) suggests that $\zeta$ is a constant function on $\Delta_{r}$ . Taking derivative with respect to (w.r.t.) $s_{1}$ and $s_{2}$ , we get

[TABLE]

and

[TABLE]

By the assumption on $\bm{A}$ , we have $\bm{a}_{i}(1)\bm{a}_{i}(2)>0,~{}\forall i$ . The assumption that $\psi_{i}$ ’s are all convex (or concave) translates to $\psi_{i}^{\prime\prime}\geq 0$ (or $\psi_{i}^{\prime\prime}\leq 0$ ), for all $i\in[M]$ . From (23), we conclude that $\psi_{i}^{\prime\prime}=0,~{}\forall i$ , which suggests that all the $\psi_{i}$ ’s are affine. $\blacksquare$

While we prove the above lemma for our use in this work, more results concerning functional equations can be found in several books on this topic, see e.g. [28, 16].

Proof of Theorem 2: Given assumptions (A2) and equation (7), (a) is a direct consequence of Lemma 2.

For (b), we note that from (a), $k_{i}(t)=d_{i}t+b_{i}$ for some constants $d_{i}$ and $b_{i}$ . Let $x=\phi_{i}(t)$ , then $t=\phi^{-1}(x)$ . Plugging into $f_{i}(\phi_{i}(t))=d_{i}t+b_{i}$ , we obtain $f_{i}(x)=d_{i}\phi^{-1}(x)+b_{i}$ .

To prove Proposition 1, we need Lemma 4 and Lemma 5, which are presented here and their proof will follow.

Lemma 4

Suppose $\bm{A}\in\mathbb{R}^{m\times r}$ is full rank and incoherent, i.e. $\bm{e}_{i}\notin\text{Range}(\bm{A}),\forall~{}i\in[m]$ . Then $\widehat{\bm{A}}=\left[\begin{array}[]{c}\bm{A}\\ \bm{1}_{r}^{\textup{\sf T}}\end{array}\right]$ is incoherent.

This lemma asserts that if a matrix $\bm{A}$ is incoherent, then appending a row of all 1’s preserves incoherence.

Lemma 5

For a tall and full rank matrix $\bm{A}\in\mathbb{R}^{m\times r}$ , where $\bm{A}$ is incoherent, there exists a $\bm{d}\in\mathbb{R}^{m}$ , such that

[TABLE]

Proof of Lemma 4: The incoherence condition means that there is no such $\bm{y}\in\mathbb{R}^{r}$ , such that $\bm{A}\bm{y}=\bm{e}_{i}$ for any $i\in[m]$ . Suppose there is a $\widehat{\bm{y}}\in\mathbb{R}^{r}$ , such that $\widehat{\bm{A}}\widehat{\bm{y}}=\bm{e}_{j}$ for some $j\in[m+1]$ . There are two cases

$1\leq j\leq m$ : This means we have $\widehat{\bm{y}}$ such that $\bm{A}\widehat{\bm{y}}=\bm{e}_{j}$ for some $j\in[m]$ – a contradiction to the assumption that $\bm{A}$ is incoherent. 2. 2.

$j=m+1$ : This means that $\bm{A}\widehat{\bm{y}}=\bm{0}_{m}$ for $\widehat{\bm{y}}\neq\bm{0}_{r}$ – a contradiction to the assumption that $\bm{A}$ is full rank.

Hence $\widehat{\bm{A}}$ is incoherent if $\bm{A}$ is full rank and incoherent. $\blacksquare$

Proof of Lemma 5: Let $\bm{U}\in\mathbb{R}^{m\times(m-r)}$ be a set of bases of the null space of $\bm{A}$ , i.e.

[TABLE]

By assumption, $\bm{A}$ is incoherent, hence $\bm{e}_{j}\notin\text{Range}(\bm{A}),~{}\forall j\in[m]$ . For any $j$ , we have the decomposition

[TABLE]

where $\widehat{\bm{e}}_{j}\in\text{Range}(\bm{A})$ and $\overline{\bm{e}}_{j}\in\text{Range}(\bm{U})$ . Since $\bm{e}_{j}\notin\text{Range}(\bm{A})$ , we have $\bm{e}_{j}^{\textup{\sf T}}\bm{U}=\overline{\bm{e}}_{j}^{\textup{\sf T}}\bm{U}\neq\bm{0}_{m-r},~{}\forall j\in[m]$ , which means $\bm{U}$ does not have a row that is all-zero.

Let $\mathcal{I}_{1},\cdots,\mathcal{I}_{m-r}$ be the index sets of nonzero entries in each column of $\bm{U}$ , then we have $\cup_{j=1}^{m-r}\mathcal{I}_{j}=[m]$ since $\bm{U}$ does not have an all-zero row. Let us present the following useful fact.

Fact 1

Let $\bm{x},\bm{y}\in\mathbb{R}^{m}$ , with sets $\mathcal{I}_{\bm{x}}$ and $\mathcal{I}_{\bm{y}}$ being the sets of indices of nonzero entries, then we can find a vector $\bm{z}\in\text{Span}\{\bm{x},\bm{y}\}$ , such that $\mathcal{I}_{\bm{z}}=\mathcal{I}_{\bm{x}}\cup\mathcal{I}_{\bm{y}}$ .

Proof: Let $a=\frac{1}{\max_{j}|\bm{x}_{j}|}$ and $b=\frac{2}{\min_{j:\bm{y}_{j}\neq 0}|\bm{y}_{j}|}$ . The denominator of $b$ is the minimum of absolute value of the nonzero entries of $\bm{y}$ . Consider the vector

[TABLE]

By the choice of $a$ and $b$ , we have $\max_{j}|a\bm{x}_{j}|=1$ and $\min_{j:\bm{y}_{j}\neq 0}|b\bm{y}_{j}|=2$ . Hence for any $j$ where $\bm{x}_{j}\neq 0$ and $\bm{y}_{j}\neq 0$ , we have $a\bm{x}_{j}+b\bm{y}_{j}\neq 0$ . This shows that there exists a $\bm{z}\in\text{Span}\{\bm{x},\bm{y}\}$ , such that $\mathcal{I}_{\bm{z}}=\mathcal{I}_{\bm{x}}\cup\mathcal{I}_{\bm{y}}$ . $\blacksquare$

We can now utilize Fact 1 to show that there exists a fully dense $\bm{d}\in\text{Range}(\bm{U})$ . Consider the first two columns of $\bm{U}$ : $\bm{U}_{1}$ and $\bm{U}_{2}$ . From Fact 1, we can find a vector ${\bm{u}}\in\text{Span}\{\bm{U}_{1},\bm{U}_{2}\}$ , such that $\mathcal{I}_{\bm{u}}=\mathcal{I}_{1}\cup\mathcal{I}_{2}$ . Now consider $\bm{u}$ and $\bm{U}_{3}$ , invoking Fact 1 again, we can find a vector $\overline{\bm{u}}\in\text{Span}\{\bm{u},\bm{U}_{3}\}$ , such that $\mathcal{I}_{\overline{\bm{u}}}=\mathcal{I}_{\bm{u}}\cup\mathcal{I}_{3}=\mathcal{I}_{1}\cup\mathcal{I}_{2}\cup\mathcal{I}_{3}$ . Continuing this process, we can find a vector $\bm{d}\in\text{Span}\{\bm{U}_{1},\cdots,\bm{U}_{m-r}\}=\text{Range}(\bm{U})$ , such that $\mathcal{I}_{\bm{d}}=\cup_{j=1}^{m-r}\mathcal{I}_{j}=[m]$ ; meaning that $\bm{d}\in\text{Range}(\bm{U})$ and is fully dense. Since $\bm{d}\in\text{Range}(\bm{U})$ , we have $\bm{A}^{\textup{\sf T}}\bm{d}=\bm{0}_{r}$ . $\blacksquare$

Proof of Proposition 1: Consider a matrix $\bm{A}\in\mathbb{R}^{m\times r}$ that is tall, full rank, and incoherent, we can rewrite (13a) as

[TABLE]

Let us denote $\widehat{\bm{A}}^{\textup{\sf T}}=\left[\begin{array}[]{cc}\bm{A}^{\textup{\sf T}}&\bm{1}_{r}\end{array}\right]$ . Then we can see that

$\widehat{\bm{A}}\in\mathbb{R}^{(m+1)\times r}$ is tall and full rank,
$\widehat{\bm{A}}$ is incoherent by Lemma 4.

We see that $\widehat{\bm{A}}$ satisfies all the conditions in Lemma 5, hence there exists a ${\bm{d}}\in\mathbb{R}^{m+1}$ such that $\widehat{\bm{A}}^{\textup{\sf T}}{\bm{d}}=\bm{0}_{r}$ , and $\|{\bm{d}}\|_{0}=m+1$ . Since $\bm{d}$ is fully dense, we construct a $\widehat{\bm{d}}\in\mathbb{R}^{m+1}$ as

[TABLE]

By this construction, we have $\widehat{\bm{d}}(m+1)=-1$ . In addition, $\widehat{\bm{A}}^{\textup{\sf T}}\widehat{\bm{d}}=\bm{0}_{r}$ as it is merely a scaled version of $\bm{d}$ . Let $\overline{\bm{d}}=\widehat{\bm{d}}(1:m)\in\mathbb{R}^{m}$ , then we have

[TABLE]

Hence we managed to show the existence of a $\bm{d}$ that satisfies both (13a) and (13b) for any $\bm{A}$ that satisfies the conditions in Proposition 1. $\blacksquare$

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Sophie Achard and Christian Jutten. Identifiability of post-nonlinear mixtures. IEEE Signal Processing Letters , 12(5):423–426, 2005.
2[2] Anima Anandkumar, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Yi-Kai Liu. A spectral algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems , pages 917–925, 2012.
3[3] Boaz Barak, Jonathan A Kelner, and David Steurer. Dictionary learning and tensor decomposition via the sum-of-squares method. In Proceedings of the forty-seventh annual ACM Symposium on Theory of Computing , pages 143–151. ACM, 2015.
4[4] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory , 39(3):930–945, 1993.
5[5] José M Bioucas-Dias, Antonio Plaza, Nicolas Dobigeon, Mario Parente, Qian Du, Paul Gader, and Jocelyn Chanussot. Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , 5(2):354–379, 2012.
6[6] Stephen Boyd and Lieven Vandenberghe. Convex optimization . Cambridge university press, 2004.
7[7] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics , 9(6):717, 2009.
8[8] Tsung-Han Chan, Chong-Yung Chi, Yu-Min Huang, and Wing-Kin Ma. A convex analysis-based minimum-volume enclosing simplex algorithm for hyperspectral unmixing. IEEE Transactions on Signal Processing , 57(11):4418–4432, 2009.