Non-negative matrix factorization based on generalized dual divergence

Karthik Devarajan

arXiv:1905.07034·stat.ML·May 20, 2019

Non-negative matrix factorization based on generalized dual divergence

Karthik Devarajan

PDF

Open Access

TL;DR

This paper introduces a comprehensive theoretical framework for non-negative matrix factorization using generalized dual divergence, encompassing various models and noise structures, with proven convergence and adaptable extensions.

Contribution

It develops a unified framework for NMF based on generalized dual divergence, including convergence proofs and potential for extensions like penalties and tensors.

Findings

01

Framework generalizes existing NMF methods

02

Convergence of algorithms is proven

03

Provides a goodness-of-fit measure

Abstract

A theoretical framework for non-negative matrix factorization based on generalized dual Kullback-Leibler divergence, which includes members of the exponential family of models, is proposed. A family of algorithms is developed using this framework and its convergence proven using the Expectation-Maximization algorithm. The proposed approach generalizes some existing methods for different noise structures and contrasts with the recently proposed quasi-likelihood approach, thus providing a useful alternative for non-negative matrix factorizations. A measure to evaluate the goodness-of-fit of the resulting factorization is described. This framework can be adapted to include penalty, kernel and discriminant functions as well as tensors.

Equations60

K (f ∣∣ g) \equiv \int (lo g \frac{f ( x )}{g ( x )}) d F (x),

K (f ∣∣ g) \equiv \int (lo g \frac{f ( x )}{g ( x )}) d F (x),

D_{β} (μ_{1} ∣∣ μ_{2}) = \frac{1}{β ( β - 1 )} {μ_{1}^{β} - β μ_{1} μ_{2}^{β - 1} + (β - 1) μ_{2}^{β}}, β \in ℜ\ {0, 1} .

D_{β} (μ_{1} ∣∣ μ_{2}) = \frac{1}{β ( β - 1 )} {μ_{1}^{β} - β μ_{1} μ_{2}^{β - 1} + (β - 1) μ_{2}^{β}}, β \in ℜ\ {0, 1} .

D_{β}^{d} (μ_{2} ∣∣ μ_{1}) = \frac{1}{β ( β - 1 )} {μ_{2}^{β} - β μ_{2} μ_{1}^{β - 1} + (β - 1) μ_{1}^{β}}, β \in ℜ\ {0, 1} .

D_{β}^{d} (μ_{2} ∣∣ μ_{1}) = \frac{1}{β ( β - 1 )} {μ_{2}^{β} - β μ_{2} μ_{1}^{β - 1} + (β - 1) μ_{1}^{β}}, β \in ℜ\ {0, 1} .

\frac{d D _{β}^{d} ( μ _{2} ∣∣ μ _{1} )}{d μ _{2}} = \frac{μ _{2}^{β - 1} - μ _{1}^{β - 1}}{β - 1}

\frac{d D _{β}^{d} ( μ _{2} ∣∣ μ _{1} )}{d μ _{2}} = \frac{μ _{2}^{β - 1} - μ _{1}^{β - 1}}{β - 1}

\frac{d ^{2} D _{β}^{d} ( μ _{2} ∣∣ μ _{1} )}{d μ _{2}^{2}} = μ_{2}^{β - 2},

\frac{d ^{2} D _{β}^{d} ( μ _{2} ∣∣ μ _{1} )}{d μ _{2}^{2}} = μ_{2}^{β - 2},

D_{β}^{d} (k μ_{2} ∣∣ k μ_{1}) = k^{β} D_{β}^{d} (μ_{2} ∣∣ μ_{1}) .

D_{β}^{d} (k μ_{2} ∣∣ k μ_{1}) = k^{β} D_{β}^{d} (μ_{2} ∣∣ μ_{1}) .

V = W H + ϵ

V = W H + ϵ

L_{2} (V ∣∣ W H) = ij \sum (V_{ij} - (W H)_{ij})^{2},

L_{2} (V ∣∣ W H) = ij \sum (V_{ij} - (W H)_{ij})^{2},

D (V ∣∣ W H) = ij \sum (V_{ij} lo g \frac{V _{ij}}{( W H ) _{ij}} - V_{ij} + (W H)_{ij}),

D (V ∣∣ W H) = ij \sum (V_{ij} lo g \frac{V _{ij}}{( W H ) _{ij}} - V_{ij} + (W H)_{ij}),

D^{d} (W H ∣∣ V) = ij \sum ((W H)_{ij} lo g \frac{( W H ) _{ij}}{V _{ij}} - (W H)_{ij} + V_{ij}) .

D^{d} (W H ∣∣ V) = ij \sum ((W H)_{ij} lo g \frac{( W H ) _{ij}}{V _{ij}} - (W H)_{ij} + V_{ij}) .

D^{d} (W H ∣∣ V) = i, j \sum {lo g (\frac{V _{ij}}{( W H ) _{ij}}) + \frac{( W H ) _{ij}}{V _{ij}} - 1}

D^{d} (W H ∣∣ V) = i, j \sum {lo g (\frac{V _{ij}}{( W H ) _{ij}}) + \frac{( W H ) _{ij}}{V _{ij}} - 1}

D^{d} (W H ∣∣ V) = i, j \sum ⎩ ⎨ ⎧ \frac{( V _{ij} - ( W H ) _{ij} ) ^{2}}{V _{ij}^{2} ( W H ) _{ij}} ⎭ ⎬ ⎫

D^{d} (W H ∣∣ V) = i, j \sum ⎩ ⎨ ⎧ \frac{( V _{ij} - ( W H ) _{ij} ) ^{2}}{V _{ij}^{2} ( W H ) _{ij}} ⎭ ⎬ ⎫

D_{α}^{d} (W H ∣∣ V) = i = 1 \sum p j = 1 \sum n \frac{{[( W H ) _{ij} ] ^{2 - α} - ( 2 - α ) [( W H ) _{ij} ] V _{ij}^{1 - α} + ( 1 - α ) V _{ij}^{2 - α} }}{( 1 - α ) ( 2 - α )}, α \in ℜ\ {1, 2} .

D_{α}^{d} (W H ∣∣ V) = i = 1 \sum p j = 1 \sum n \frac{{[( W H ) _{ij} ] ^{2 - α} - ( 2 - α ) [( W H ) _{ij} ] V _{ij}^{1 - α} + ( 1 - α ) V _{ij}^{2 - α} }}{( 1 - α ) ( 2 - α )}, α \in ℜ\ {1, 2} .

D^{d}_{\alpha}(WH||V)=\left\{\begin{array}[]{l}\displaystyle\sum_{i,j}\left\{[(WH)_{ij}]^{2-\alpha}-(2-\alpha)[(WH)_{ij}]V^{1-\alpha}_{ij}+(1-\alpha)V^{2-\alpha}_{ij}\right\},\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \alpha\in(-\infty,1)\bigcup(2,\infty)\\ \\ \displaystyle\sum_{i,j}\left\{-[(WH)_{ij}]^{2-\alpha}+(2-\alpha)[(WH)_{ij}]V^{1-\alpha}_{ij}-(1-\alpha)V^{2-\alpha}_{ij}\right\},\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1<\alpha<2,\\ \displaystyle\sum_{i,j}\left\{(WH)_{ij}\log\left(\frac{(WH)_{ij}}{V_{ij}}\right)-(WH)_{ij}+V_{ij}\right\},\alpha=1,\\ \displaystyle\sum_{i,j}\left\{\log\left(\dfrac{V_{ij}}{{(WH)}_{ij}}\right)+\dfrac{{(WH)}_{ij}}{V_{ij}}-1\right\},\alpha=2.\end{array}\right.

D^{d}_{\alpha}(WH||V)=\left\{\begin{array}[]{l}\displaystyle\sum_{i,j}\left\{[(WH)_{ij}]^{2-\alpha}-(2-\alpha)[(WH)_{ij}]V^{1-\alpha}_{ij}+(1-\alpha)V^{2-\alpha}_{ij}\right\},\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \alpha\in(-\infty,1)\bigcup(2,\infty)\\ \\ \displaystyle\sum_{i,j}\left\{-[(WH)_{ij}]^{2-\alpha}+(2-\alpha)[(WH)_{ij}]V^{1-\alpha}_{ij}-(1-\alpha)V^{2-\alpha}_{ij}\right\},\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1<\alpha<2,\\ \displaystyle\sum_{i,j}\left\{(WH)_{ij}\log\left(\frac{(WH)_{ij}}{V_{ij}}\right)-(WH)_{ij}+V_{ij}\right\},\alpha=1,\\ \displaystyle\sum_{i,j}\left\{\log\left(\dfrac{V_{ij}}{{(WH)}_{ij}}\right)+\dfrac{{(WH)}_{ij}}{V_{ij}}-1\right\},\alpha=2.\end{array}\right.

H_{aj}^{t + 1} = H_{aj}^{t} \frac{\sum _{i} ( \frac{1}{\sum _{b} W _{ib} H _{bj}^{t}} ) ^{α - 1} W _{ia}}{\sum _{i} W _{ia} V _{ij}^{1 - α}}^{1/ (α - 1)}

H_{aj}^{t + 1} = H_{aj}^{t} \frac{\sum _{i} ( \frac{1}{\sum _{b} W _{ib} H _{bj}^{t}} ) ^{α - 1} W _{ia}}{\sum _{i} W _{ia} V _{ij}^{1 - α}}^{1/ (α - 1)}

W_{ia}^{t + 1} = W_{ia}^{t} \frac{\sum _{j} ( \frac{1}{\sum _{b} W _{ib}^{t} H _{bj}} ) ^{α - 1} H _{aj}}{\sum _{j} H _{aj} V _{ij}^{1 - α}}^{1/ (α - 1)} .

W_{ia}^{t + 1} = W_{ia}^{t} \frac{\sum _{j} ( \frac{1}{\sum _{b} W _{ib}^{t} H _{bj}} ) ^{α - 1} H _{aj}}{\sum _{j} H _{aj} V _{ij}^{1 - α}}^{1/ (α - 1)} .

F (H_{aj}) = (1 - α) i \sum V_{ij}^{2 - α} - (2 - α) i \sum {V_{ij}^{1 - α} (a \sum W_{ia} H_{aj})} + i \sum [a \sum W_{ia} H_{aj}]^{2 - α},

F (H_{aj}) = (1 - α) i \sum V_{ij}^{2 - α} - (2 - α) i \sum {V_{ij}^{1 - α} (a \sum W_{ia} H_{aj})} + i \sum [a \sum W_{ia} H_{aj}]^{2 - α},

G (H_{aj}, H_{aj}^{t})

G (H_{aj}, H_{aj}^{t})

(a \sum W_{ia} H_{aj})^{2 - α} \leq a \sum γ_{a} (\frac{W _{ia} H _{aj}}{γ _{a}})^{2 - α}

(a \sum W_{ia} H_{aj})^{2 - α} \leq a \sum γ_{a} (\frac{W _{ia} H _{aj}}{γ _{a}})^{2 - α}

\frac{d G ( H _{aj} , H _{aj}^{t} )}{d H _{aj}} = - (2 - α) i \sum W_{ia} V_{ij}^{1 - α} + (2 - α) i \sum ⎩ ⎨ ⎧ (W_{ia} H_{aj})^{1 - α} W_{ia} (\frac{W _{ia} H _{aj}^{t}}{\sum _{b} W _{ib} H _{bj}^{t}})^{α - 1} ⎭ ⎬ ⎫ = 0.

\frac{d G ( H _{aj} , H _{aj}^{t} )}{d H _{aj}} = - (2 - α) i \sum W_{ia} V_{ij}^{1 - α} + (2 - α) i \sum ⎩ ⎨ ⎧ (W_{ia} H_{aj})^{1 - α} W_{ia} (\frac{W _{ia} H _{aj}^{t}}{\sum _{b} W _{ib} H _{bj}^{t}})^{α - 1} ⎭ ⎬ ⎫ = 0.

F (H_{aj}) = - (1 - α) i \sum V_{ij}^{2 - α} + (2 - α) i \sum {V_{ij}^{1 - α} a \sum W_{ia} H_{aj}} - i \sum [a \sum W_{ia} H_{aj}]^{2 - α},

F (H_{aj}) = - (1 - α) i \sum V_{ij}^{2 - α} + (2 - α) i \sum {V_{ij}^{1 - α} a \sum W_{ia} H_{aj}} - i \sum [a \sum W_{ia} H_{aj}]^{2 - α},

G (H_{aj}, H_{aj}^{t})

G (H_{aj}, H_{aj}^{t})

H_{aj}^{t + 1} = H_{aj}^{t} exp \frac{\sum _{i} W _{ia} lo g ( \frac{V _{ij}}{\sum _{b} W _{ib} H _{bj}^{t}} )}{\sum _{i} W _{ia}}

H_{aj}^{t + 1} = H_{aj}^{t} exp \frac{\sum _{i} W _{ia} lo g ( \frac{V _{ij}}{\sum _{b} W _{ib} H _{bj}^{t}} )}{\sum _{i} W _{ia}}

W_{ia}^{t + 1} = W_{ia}^{t} exp \frac{\sum _{j} H _{aj} lo g ( \frac{V _{ij}}{\sum _{b} W _{ib}^{t} H _{bj}} )}{\sum _{j} H _{aj}} .

W_{ia}^{t + 1} = W_{ia}^{t} exp \frac{\sum _{j} H _{aj} lo g ( \frac{V _{ij}}{\sum _{b} W _{ib}^{t} H _{bj}} )}{\sum _{j} H _{aj}} .

H_{aj}^{t + 1} = α \to 1 lim H_{aj}^{t} \frac{\sum _{i} ( \frac{1}{\sum _{b} W _{ib} H _{bj}^{t}} ) ^{α - 1} W _{ia}}{\sum _{i} W _{ia} V _{ij}^{1 - α}}^{1/ (α - 1)} .

H_{aj}^{t + 1} = α \to 1 lim H_{aj}^{t} \frac{\sum _{i} ( \frac{1}{\sum _{b} W _{ib} H _{bj}^{t}} ) ^{α - 1} W _{ia}}{\sum _{i} W _{ia} V _{ij}^{1 - α}}^{1/ (α - 1)} .

H_{aj}^{t} ψ (α) = H_{aj}^{t} (\frac{\sum _{i} W _{ia} V _{ij}^{1 - α}}{\sum _{i} W _{ia} ( \sum _{b} W _{ib} H _{bj}^{t} ) ^{1 - α}})^{1/ (1 - α)} .

H_{aj}^{t} ψ (α) = H_{aj}^{t} (\frac{\sum _{i} W _{ia} V _{ij}^{1 - α}}{\sum _{i} W _{ia} ( \sum _{b} W _{ib} H _{bj}^{t} ) ^{1 - α}})^{1/ (1 - α)} .

lo g H_{aj}^{t + 1} = lo g H_{aj}^{t} + α \to 1 lim lo g ψ (α) = lo g H_{aj}^{t} + α \to 1 lim \frac{1}{1 - α} {lo g (\frac{\sum _{i} W _{ia} V _{ij}^{1 - α}}{\sum _{i} W _{ia} ( \sum _{b} W _{ib} H _{bj}^{t} ) ^{1 - α}})} .

lo g H_{aj}^{t + 1} = lo g H_{aj}^{t} + α \to 1 lim lo g ψ (α) = lo g H_{aj}^{t} + α \to 1 lim \frac{1}{1 - α} {lo g (\frac{\sum _{i} W _{ia} V _{ij}^{1 - α}}{\sum _{i} W _{ia} ( \sum _{b} W _{ib} H _{bj}^{t} ) ^{1 - α}})} .

lo g H_{aj}^{t + 1} = lo g H_{aj}^{t} + \frac{\sum _{i} W _{ia} lo g ( \frac{V _{ij}}{\sum _{b} W _{ib} H _{bj}^{t}} )}{\sum _{i} W _{ia}} .

lo g H_{aj}^{t + 1} = lo g H_{aj}^{t} + \frac{\sum _{i} W _{ia} lo g ( \frac{V _{ij}}{\sum _{b} W _{ib} H _{bj}^{t}} )}{\sum _{i} W _{ia}} .

H_{aj}^{t + 1} = H_{aj}^{t} exp \frac{\sum _{i} W _{ia} lo g ( \frac{V _{ij}}{\sum _{b} W _{ib} H _{bj}^{t}} )}{\sum _{i} W _{ia}} .

H_{aj}^{t + 1} = H_{aj}^{t} exp \frac{\sum _{i} W _{ia} lo g ( \frac{V _{ij}}{\sum _{b} W _{ib} H _{bj}^{t}} )}{\sum _{i} W _{ia}} .

R^{2}

R^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBlind Source Separation Techniques · Statistical and numerical algorithms · Face and Expression Recognition

Full text

Non-negative matrix factorization based on generalized dual divergence

Karthik Devarajan

*Department of Biostatistics & Bioinformatics, Fox Chase Cancer Center,

Temple University Health System, Philadelphia, PA*

[email protected]

Keywords: nonnegative matrix factorization, Kullback-Leibler divergence, dual divergence, EM algorithm, high dimensional data, tensor

Abstract

A theoretical framework for non-negative matrix factorization based on generalized dual Kullback-Leibler divergence, which includes members of the exponential family of models, is proposed. A family of algorithms is developed using this framework and its convergence proven using the Expectation-Maximization algorithm. The proposed approach generalizes some existing methods for different noise structures and contrasts with the recently proposed quasi-likelihood approach, thus providing a useful alternative for non-negative matrix factorizations. A measure to evaluate the goodness-of-fit of the resulting factorization is described. This framework can be adapted to include penalty, kernel and discriminant functions as well as tensors.

1 Kullback-Leibler divergence and its dual

The Kullback-Leibler ( $KL$ ) information divergence between two distributions $F$ and $G$ with density (mass) functions $f$ and $g$ is

[TABLE]

given that $F$ is absolutely continuous with respect to $G$ , $F\preceq G$ . The discrimination information function in equation (1.1) is a measure commonly used to compare two distributions, and was introduced in Kullback and Leibler (1951). $KL$ information divergence, also referred to as relative entropy or cross-entropy, is the fundamental information measure with many desirable properties for developing probability and statistical methodologies. Similarly, the measure $K(g||f)$ is known as dual Kullback-Leibler divergence between $F$ and $G$ . In light of the definition above, $K(f||g)$ and $K(g||f)$ are also known as directed divergences. These quantities are nonnegative definite and are zero if and only if $f(x)=g(x)$ almost everywhere (Kullback, 1959; Ebrahimi and Soofi, 2004). One issue pertaining to $K(f||g)$ is that, apart from some exceptional cases such as $F=N(\mu_{1},\sigma^{2})$ and $G=N(\mu_{2},\sigma^{2})$ , $K(f||g)$ is not symmetric in $F$ and $G$ where the latter is the reference distribution, i.e., $K(f||g)\neq K(g||f)$ . This lack of symmetry may be of no concern or even desirable in situations where a natural or ideal reference is at hand; e.g., when $G$ is uniform, a natural reference distribution for a problem. However, this is generally not the case for most problems and choice of reference is dependent on the particular application of interest.

Let $\mu_{1}$ and $\mu_{2}$ be the means of random variables corresponding to the probability models $F$ and $G$ with respective densities $f$ and $g$ . Then $\beta$ -divergence, $D_{\beta}(\mu_{1}||\mu_{2})$ , expressed in terms of the means $\mu_{1}$ and $\mu_{2}$ can be written as

[TABLE]

$\beta$ -divergence between two densities $f$ and $g$ was introduced by Basu et al. (1998) and Eguchi & Kano (2001). It has been used by Févotte & Idier (2011) for non-negative matrix factorizations (NMF) where $\beta$ -divergence between two objects is considered. In our case, the means $\mu_{1}$ and $\mu_{2}$ represent these objects and we will follow this notation in the remainder of this section. It is well known that $\beta$ -divergence in equation (1.2) includes members of the exponential family of models such as the Gaussian $(\beta=2)$ , Poisson $(\beta\rightarrow 1)$ , gamma $(\beta\rightarrow 0)$ and inverse Gaussian $(\beta=-1)$ models as special cases. Within this context, $\beta$ -divergence can be interpreted as generalized $KL$ divergence indexed by the parameter $\beta$ (Devarajan & Cheung, 2016). For example, when $\beta=2$ we obtain the Gaussian likelihood $\frac{1}{2}{(\mu_{1}-\mu_{2})}^{2}$ , and, in the limit $\beta\rightarrow 0$ , we obtain the gamma likelihood $\log\frac{\mu_{1}}{\mu_{2}}-\frac{\mu_{1}}{\mu_{2}}+1$ . In the limit $\beta\rightarrow 1$ , we obtain the Poisson likelihood $\mu_{1}\log\frac{\mu_{1}}{\mu_{2}}-\mu_{1}+\mu_{2}$ used in Lee & Seung (2001). These quantities are commonly referred to as Euclidean distance (ED), Itakuro-Saito (IS) divergence and $KL$ divergence, respectively, in the NMF literature (Févotte & Idier, 2011; Devarajan & Cheung, 2014; Lee & Seung, 2001). However, it should be noted that our use of the term $KL$ divergence has a broader connotation similar to that in Devarajan & Cheung (2014, 2016) and is based on its original definition outlined in Kullback (1951).

We define the generalized dual $KL$ divergence of order $\beta$ by reversing the roles of $\mu_{1}$ and $\mu_{2}$ in equation (1.2). It is given by

[TABLE]

where the superscript $d$ is used to denote this dual form which also includes, as special cases, members of the exponential family of models as outlined above. When $\beta=2$ we obtain the Gaussian likelihood $\frac{1}{2}{(\mu_{2}-\mu_{1})}^{2}$ which is identical to ED, and, in the limit $\beta\rightarrow 0$ , we obtain $-\log\frac{\mu_{2}}{\mu_{1}}+\frac{\mu_{2}}{\mu_{1}}-1$ which can be viewed as the dual version of IS divergence. Consider $D^{d}_{\beta}(\mu_{2}||\mu_{1})$ as a function of $\mu_{2}$ with $\mu_{1}$ fixed. Following Févotte & Idier (2011), we find that the first and second derivatives of $D^{d}_{\beta}(\mu_{2}||\mu_{1})$ with respect to $\mu_{2}$ given by

[TABLE]

and

[TABLE]

respectively, are continuous in $\beta$ . It is evident from equations (1.4) and (1.5) that $D^{d}_{\beta}(\mu_{2},\mu_{1})$ has a unique minimum at $\mu_{2}=\mu_{1}$ and that it is convex in $\mu_{2}$ for $\beta\in\Re$ (see Figure 1). This contrasts significantly with $\beta$ -divergence which is convex in $\mu_{2}$ only for $\beta\in[1,2]$ (Févotte & Idier, 2011). For a scalar $k>0$ , $D^{d}_{\beta}(\mu_{2}||\mu_{1})$ also satisfies the scale property of $D_{\beta}(\mu_{1}||\mu_{2})$ , i.e.,

[TABLE]

Scale invariance is attained for the case $\beta=0$ in equation (1.3) (dual version of IS divergence).

2 Motivating NMF using generalized dual divergence

Lee and Seung (1999, 2001) developed NMF algorithms for decomposing a $p\times n$ non-negative matrix $V$ into the product of lower dimensional non-negative matrices $W_{p\times k}$ and $H_{k\times n}$ such that $V\sim WH$ , where $k<\frac{np}{n+p}$ is the factorization rank. In order to find an approximation for the input matrix $V$ , cost functions that quantify the quality of the approximation need to be constructed using some measure of divergence between $V$ and the reconstructed matrix $WH$ . This problem can be formulated in the form of the linear model

[TABLE]

where $\epsilon$ is noise. Lee & Seung’s algorithms were based on ED,

[TABLE]

and the directed divergence measure,

[TABLE]

which correspond to the addition of Gaussian and Poisson noise, respectively, in (2.1). As noted earlier, the quantity in equation (2.2) can be derived as $KL$ divergence between two Gaussian random variables with means $\mu_{1}$ and $\mu_{2}$ (and equal variance) and the quantity in equation (2.3) can be derived as $KL$ divergence between two Poisson random variables with means $\mu_{1}$ and $\mu_{2}$ (see also Devarajan & Cheung, 2016). Unlike $L_{2}(V||WH)$ which is symmetric, $D(V||WH)\neq D(WH||V)$ , so Lee and Seung (2001) referred to $D(V||WH)$ as the divergence of $V$ from $WH$ . In order to distinguish between the two directed divergences, $D(V||WH)$ and $D(WH||V)$ , we use the slight change in notation, $D^{d}(WH||V)$ , introduced in equation (1.3). Recently, Devarajan et al. (2015b) derived an algorithm for NMF using the directed divergence $D^{d}(WH||V)$ for the Poisson model given by

[TABLE]

This quantity can be derived as dual $KL$ divergence between two Poisson random variables with means $\mu_{1}$ and $\mu_{2}$ as $\beta\rightarrow 1$ in equation (1.3). Similarly, Devarajan & Cheung (2014) developed NMF algorithms for signal-dependent noise using

[TABLE]

for the gamma model and

[TABLE]

for the inverse Gaussian model, quantities that can be derived based on dual $KL$ divergence for the respective models when $\beta\rightarrow 0$ and $\beta=-1$ in equation (1.3). Furthermore, Dhillon & Sra (2006) and Cichocki et al. (2009) have proposed NMF algorithms using some special cases of dual divergence.

Since the seminal work of Lee & Seung (2001), a variety of generalized divergence measures have been utilized for NMF in different applications. Examples include Cheung & Tresch (2005), Dhillon & Sra (2006), Kompass (2007), Cichocki et al. (2006, 2008, 2009, 2011), Févotte & Idier (2011) and Devarajan et al. (2015a,b; 2016). The works of Cheung & Tresch (2005), Cichocki et al. (2006), Févotte & Idier (2011) and Devarajan & Cheung (2016) are particularly relevant to the context of this paper. Cheung & Tresch (2005) rely directly on the likelihood approach while Cichocki et al. (2006) and Févotte & Idier (2011) utilize $\beta$ -divergence in equation (1.2). Recently, Devarajan & Cheung (2016) proposed a quasi-likelihood approach to NMF based on a unifying theoretical framework using the theory of generalized linear models. It includes all members of the exponential family of models and enables the use of link functions for modeling nonlinear effects. An underlying feature of all these approaches is that they are based on a generalization of $KL$ divergence in some form or another, unified by the approach in Devarajan & Cheung (2016). Although NMF algorithms for various special cases of generalized dual divergence in (1.3) exist as outlined earlier, a unifying approach that integrates different models and algorithms into a single framework has been lacking.

Within the context of NMF, we can express generalized dual $KL$ divergence of order $\alpha$ between the input matrix $V$ and reconstructed matrix $WH$ as

[TABLE]

using equation (1.3) and the re-parametrization $\beta=2-\alpha$ . It is evident from (2.7) that $D^{d}_{\alpha}(WH||V)$ represents a continuum of divergence measures indexed by the parameter $\alpha$ . More importantly, it embeds the dual KL divergence of well-known models like the Gaussian ( $\alpha=0$ ), Poisson ( $\alpha\rightarrow 1$ ), gamma ( $\alpha\rightarrow 2$ ) and inverse Gaussian ( $\alpha=3$ ) models. When $1<\alpha<2$ , it includes the compound Poisson (CP) model which is continuous for $V_{ij}>0$ but allows exact zeros. By appropriately incorporating a dispersion parameter in (2.7), $D^{d}_{\alpha}(WH||V)$ includes the quasi-Poisson model which is useful for modeling over- or under-dispersion as $\alpha\rightarrow 1$ . Furthermore, it includes the extreme stable ( $\alpha\leq 0$ ) and positive stable models ( $\alpha>2$ ) (Tweedie, 1981; Jorgensen, 1987).

Although $\beta$ -divergence includes members of the exponential family of models, it is evident from the work of Févotte & Idier (2011) that a unified NMF algorithm is not feasible due to the non-convexity of the objective function (1.2) for certain ranges of the parameter $\beta$ . It turns out that this is not the case with generalized dual $KL$ divergence (2.7) and that a unified algorithm is indeed possible as shown in the following section. Here, we develop such an algorithm for NMF indexed by the parameter $\alpha$ by minimizing the cost function in equation (2.7). Such an approach generalizes prior work the work of Devarajan & Cheung (2014) and Devarajan et al. (2015b) and embeds algorithms for members of the exponential family of models as special cases within a unifying statistical framework.

3 A unified NMF algorithm based on dual divergence

We derive a unified NMF algorithm where $\epsilon$ in equation (2.1) is a member of the class of models included in (2.7). One can ignore $\frac{1}{(1-\alpha)(2-\alpha)}$ in (2.7) and define the function

[TABLE]

Thus, for any information measure which is proportional to $D^{d}_{\alpha}(WH||V)$ we obtain equation (3.1). In the case of signal-dependent data such as those observed in various signal processing applications, the divergence in equation (3.1) offers a flexible choice in decomposing a high-dimensional matrix.

Theorem 1.

For $\alpha\in\Re\backslash\{1\}$ , the measure $D^{d}_{\alpha}(WH||V)$ in equation (3.1) is non-increasing under the multiplicative update rules for $W$ and $H$ given by

[TABLE]

and

[TABLE]

This measure is also invariant under these updates if and only if $W$ and $H$ are at a stationary point of the divergence.

Proof. We provide a more general proof of the monotonicity of updates based on splitting the domain $\Re\backslash\{1\}$ of the parameter $\alpha$ into three disjoint regions and considering them separately. The update rules for $W$ and $H$ obtained under all cases, however, are the same. A detailed proof of the monotonicity of updates and update rules for the special cases $\alpha=2$ and $\alpha=3$ are provided in Devarajan & Cheung (2014). In §3.1, we prove monotonicity of updates and derive update rules for the special case $\alpha=1$ .

First, we derive the update for $H$ and prove its monotonicity when $\alpha>2$ or $\alpha<1$ . Then we show how similar arguments can be used to prove the result for $1<\alpha<2$ . We will make use of an auxiliary function similar to the one used in the EM algorithm (Dempster et al., 1977; Lee & Seung, 2001; Devarajan & Cheung, 2016). Note that for $h$ real, $G(h,h^{\prime})$ is an auxiliary function for $F(h)$ if $G(h,h^{\prime})\geq F(h)$ and $G(h,h)=F(h)$ where $G$ and $F$ are scalar valued functions. Also, if $G$ is an auxiliary function, then $F$ is non-increasing under the update $h^{t+1}=\arg\displaystyle\min_{h}G(h,h^{t})$ . Using the first equation in (3.1), we define

[TABLE]

where $H_{aj}$ denotes the ${aj}^{th}$ entry of $H$ . Then the auxiliary function for $F(H_{aj})$ is

[TABLE]

It is straightforward to show that $G(H_{aj},H_{aj})=F(H_{aj})$ . To show that $G(H_{aj},H^{t}_{aj})\geq F(H_{aj})$ , we use the convexity of $x^{2-\alpha}$ when $\alpha>2$ or $\alpha<1$ and the fact that for any convex function $f,f\left(\sum^{n}_{i=1}r_{i}x_{i}\right)\leq\sum^{n}_{i=1}r_{i}f(x_{i})$ for rational nonnegative numbers $r_{1},\cdots,r_{n}$ such that $\sum^{n}_{i=1}r_{i}=1$ . We then obtain

[TABLE]

where $\gamma_{a}=\dfrac{W_{ia}H^{t}_{aj}}{\sum_{b}W_{ib}H^{t}_{bj}}$ . From this inequality it follows that $F(H_{aj})\leq G(H_{aj},H^{t}_{aj})$ . The minimizer of $F(H_{aj})$ is obtained by solving

[TABLE]

The update rule for $H$ thus takes the form given in (3.2). For $1<\alpha<2$ , using the second equation in (3.1) we define

[TABLE]

and the auxiliary function for $F(H_{aj})$ as

[TABLE]

It is easy to see that $G(H_{aj},H_{aj})=F(H_{aj})$ . By using the convexity of $-x^{2-\alpha}$ for $1<\alpha<2$ , we can show that $F(H_{aj})\leq G(H_{aj},H^{t}_{aj})$ and proceed to obtain the update rule for $H$ as described above. The update rule for this case is exactly as that specified for the case $\alpha>2$ or $\alpha<1$ . By using symmetry of the decomposition $V\sim WH$ and by reversing the arguments on $W$ , one can easily obtain the update rule for $W$ given in (3.3) in the same manner as $H$ .

For a given $\alpha$ , we will start with random initial values for $W$ and $H$ and iterate until convergence, i.e, iterate until $|D^{d,(i)}_{\alpha}(WH||V)-D^{d,(i-1)}_{\alpha}(WH||V)|<\delta$ where $\delta$ is a pre-specified threshold between [math] and $1$ and $i$ denotes iteration number.

3.1 Special Cases

As noted before, $D(WH||V)=D(V||WH)=\sum_{ij}{(V_{ij}-(WH)_{ij})}^{2}$ for the Gaussian model corresponding to $\alpha=0$ . Hence the NMF algorithm for the Gaussian model based on dual KL divergence is identical o the standard algorithm based on Euclidean distance outlined in Lee & Seung (2001) (Devarajan & Cheung, 2014). When $\alpha\rightarrow 2$ and $\alpha=3$ in equation (2.7), we obtain dual $KL$ divergence for the gamma and inverse Gaussian models in equations (2.5) and (2.6), respectively. As noted earlier, NMF algorithms for these two models have been described in Devarajan & Cheung (2014) where monotonicity of updates was proved and update rules were derived for each model. Even though the gamma model is obtained as the limiting case $\alpha\rightarrow 2$ in (3.1), closed form update rules for $W$ and $H$ can be obtained using $\alpha=2$ in the generalized update rules in equations (3.2) and (3.3). The Poisson special case is discussed below.

3.1.1 Poisson Model

When $\alpha\rightarrow 1$ in equation (2.7), we obtain dual $KL$ divergence for the Poisson model given in equation (2.4). Devarajan et al. (2015b) provide an algorithm for this model involving multiplicative updates for $W$ and $H$ but without a formal proof. These update rules are obtained from (3.2) and (3.3) in the limit $\alpha\rightarrow 1$ and are derived in Theorem 2 below.

Theorem 2.

The measure in equation (2.4) is non-increasing under the multiplicative update rules for $W$ and $H$ given by

[TABLE]

and

[TABLE]

This measure is also invariant under these updates if and only if $W$ and $H$ are at a stationary point of the divergence.

Proof. Using (3.2), the update rule for $H$ for the Poisson model can be written as

[TABLE]

The right hand side of (3.8) can be re-written as a function of $\alpha$ as

[TABLE]

Using (3.9) in (3.8) and taking logarithm on both sides, we get

[TABLE]

Applying l’Hospital’s rule to compute the limit, we obtain

[TABLE]

Hence

[TABLE]

Similarly, the update rule for $W$ can be obtained as specified in (3.7). Monotonicity of these updates follows directly from the monotonicity of generalized updates in equations (3.2) and (3.3) established in Theorem 1 when $\alpha\rightarrow 1$ .

4 Measuring Goodness-of-fit

The updates derived in equations (3.2), (3.3), (3.6) and (3.7) ensure monotonicity of updates for a given run of the NMF algorithm for pre-specified $\alpha$ and rank $r$ , based on random initial values for $W$ and $H$ . However, NMF algorithms are typically prone to the problem of local minima and, thus, require the algorithm using multiple random restarts. The factorization from the run that produces the best reconstruction, quantified by minimum reconstruction error across multiple runs, can be used for assessing goodness-of-fit. Following Devarajan & Cheung (2014, 2016), we propose a unified measure for this purpose based on model-specific minimum reconstruction error, $RE$ . It quantifies the variation explained by the continuum of statistical models contained in equation (3.1). For a given rank $r$ the proportion of explained variation, $R^{2}$ , is dependent on the particular model, determined by $\alpha$ , used in the factorization and is computed as

[TABLE]

where $RE$ is the numerator on the right hand side of equation (4.1), $D^{d}(WH||V)$ is as specified in equation (3.1) and $WH$ represents the reconstructed matrix. For rank $r$ , the $(i,j)^{th}$ entry of $WH$ is $(WH)_{ij}=\sum_{a=1}^{r}W_{ia}H_{aj}$ ; in the denominator, each entry is replaced by the grand mean of all entries of the input matrix $V$ , $\bar{V}=\dfrac{1}{np}\left\{\sum_{i=1}^{p}\sum_{j=1}^{n}V_{ij}\right\}$ . Note that when $\alpha=0$ , these quantities can be interpreted as the residual and total sum of squares, respectively, associated with the Gaussian model. For the nonlinear models indexed by $\alpha$ in equation (3.1), $R^{2}$ measures the proportion of empirical uncertainty explained by the inclusion of $W$ and $H$ (Cameron & Windmeijer, 1997; Devarajan & Cheung, 2014; 2016).

5 Applications

Several special cases of the proposed unifying framework have been utilized for NMFs involving a variety of applications. For instance, Devarajan & Cheung (2014) derived algorithms based on dual divergence for gamma and inverse Gaussian models - using equations (2.5) and (2.6), respectively - for handling signal-dependent noise structures and demonstrated their application in electromyography studies for extraction of muscle synergies. These methods explained more variation ( $R^{2}$ ) in the data at the appropriate number of synergies identified for each data set in a study involving frog motor behaviors under different experimental conditions. Similarly, Devarajan et al. (2015b) proposed an algorithm for the Poisson model based on dual divergence in equation (2.4) for unsupervised dimension reduction of discrete multivariate data. Two benchmark data sets - the Reuters news groups data and the Sacchromyces Genome Database (Shahnaz et al., 2006; Chagoyen et al., 2006) - were utilized for this purpose. In both cases, the algorithm based on dual divergence resulted in the best reconstruction compared to other competing methods. The proposed approach consolidates the above methods as well as a spectrum of other methods into a unifying framework and, thus, provides a flexible alternative for exploratory analysis of high dimensional data generated by diverse mechanisms that are exclusive to different applications.

6 Conclusions

In summary, this paper presented a unified approach to NMF based on generalized dual $KL$ divergence along with a rigorous proof of convergence. The proposed approach is broadly applicable to the exponential family of models and is particularly useful in applications where there is a priori knowledge or empirical evidence of signal-dependence in noise. Furthermore, it unifies various existing algorithms and contrasts with the recently proposed quasi-likelihood approach, thus providing a complementary view of NMF. The basic principle underlying this framework is broadly extensible to the use of penalty, kernel and discriminant functions and to tensors.

Figure Legend

Figure 1, panels (a)-(d): Generalized dual $KL$ divergence, equation (1.3), plotted as a function of $\mu_{2}=\mu$ for $\mu_{1}=1$ and various choices of $\alpha$ , illustrating its convexity across the entire range of $\alpha$ . The values of $\alpha$ are indicated in the legend within each panel.

Acknowledgements

Research of the author was supported in part by NIH Grant P30 CA06927.

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Basu, A., Harris, I.R., Hjort, N.L. and Jones, M.C. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika , 85(3):549–559.
2[2] Cameron, A.C., Windmeijer, F.A.G. (1997). An R-squared measure of goodness of fit for some common nonlinear regression models. Journal of Econometrics , 77(2):329-342.
3[3] Chagoyen, M., Carmona-Saez, P., Shatkay, H., Carazo, J.M., Pascual-Montano, A. (2006). Discovering semantic features in the literature: a foundation for building functional associations. BMC Bioinformatics . 7:41.
4[4] Cheung, V.C.K. Tresch, M.C. (2005). Nonnegative matrix factorization algorithms modeling noise distributions within the exponential family. Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference , 4990-4993.
5[5] Cichocki, A., Zdunek, R., Amari, S. (2006). Csiszar’s divergences for non-negative matrix factorization: Family of new algorithms. Lecture Notes in Computer Science, Independent Component Analysis and Blind Signal Separation , Springer, LNCS-3889, 32-39.
6[6] Cichocki, A., Lee, H., Kim, Y.-D., Choi, S. (2008). Non-negative matrix factorization with α 𝛼 \alpha -divergence. Pattern Recognition Letters , 29(9):1433-1440.
7[7] Cichocki, A., Zdunek, R., Phan, A.H., Amari, S. (2009). Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation . John Wiley.
8[8] Cichocki, A., Cruces, S., Amari, S. (2011). Generalized Alpha-Beta divergences and their application to robust nonnegative matrix factorization. Entropy , 13:134-170.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

1 Kullback-Leibler divergence and its dual

2 Motivating NMF using generalized dual divergence

3 A unified NMF algorithm based on dual divergence

Theorem 1**.**

3.1 Special Cases

3.1.1 Poisson Model

Theorem 2**.**

4 Measuring Goodness-of-fit

5 Applications

6 Conclusions

Figure Legend

Acknowledgements

Theorem 1.

Theorem 2.