Non-negative matrix factorization based on generalized dual divergence
Karthik Devarajan

TL;DR
This paper introduces a comprehensive theoretical framework for non-negative matrix factorization using generalized dual divergence, encompassing various models and noise structures, with proven convergence and adaptable extensions.
Contribution
It develops a unified framework for NMF based on generalized dual divergence, including convergence proofs and potential for extensions like penalties and tensors.
Findings
Framework generalizes existing NMF methods
Convergence of algorithms is proven
Provides a goodness-of-fit measure
Abstract
A theoretical framework for non-negative matrix factorization based on generalized dual Kullback-Leibler divergence, which includes members of the exponential family of models, is proposed. A family of algorithms is developed using this framework and its convergence proven using the Expectation-Maximization algorithm. The proposed approach generalizes some existing methods for different noise structures and contrasts with the recently proposed quasi-likelihood approach, thus providing a useful alternative for non-negative matrix factorizations. A measure to evaluate the goodness-of-fit of the resulting factorization is described. This framework can be adapted to include penalty, kernel and discriminant functions as well as tensors.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBlind Source Separation Techniques · Statistical and numerical algorithms · Face and Expression Recognition
Non-negative matrix factorization based on generalized dual divergence
Karthik Devarajan
*Department of Biostatistics & Bioinformatics, Fox Chase Cancer Center,
Temple University Health System, Philadelphia, PA*
Keywords: nonnegative matrix factorization, Kullback-Leibler divergence, dual divergence, EM algorithm, high dimensional data, tensor
Abstract
A theoretical framework for non-negative matrix factorization based on generalized dual Kullback-Leibler divergence, which includes members of the exponential family of models, is proposed. A family of algorithms is developed using this framework and its convergence proven using the Expectation-Maximization algorithm. The proposed approach generalizes some existing methods for different noise structures and contrasts with the recently proposed quasi-likelihood approach, thus providing a useful alternative for non-negative matrix factorizations. A measure to evaluate the goodness-of-fit of the resulting factorization is described. This framework can be adapted to include penalty, kernel and discriminant functions as well as tensors.
1 Kullback-Leibler divergence and its dual
The Kullback-Leibler () information divergence between two distributions and with density (mass) functions and is
[TABLE]
given that is absolutely continuous with respect to , . The discrimination information function in equation (1.1) is a measure commonly used to compare two distributions, and was introduced in Kullback and Leibler (1951). information divergence, also referred to as relative entropy or cross-entropy, is the fundamental information measure with many desirable properties for developing probability and statistical methodologies. Similarly, the measure is known as dual Kullback-Leibler divergence between and . In light of the definition above, and are also known as directed divergences. These quantities are nonnegative definite and are zero if and only if almost everywhere (Kullback, 1959; Ebrahimi and Soofi, 2004). One issue pertaining to is that, apart from some exceptional cases such as and , is not symmetric in and where the latter is the reference distribution, i.e., . This lack of symmetry may be of no concern or even desirable in situations where a natural or ideal reference is at hand; e.g., when is uniform, a natural reference distribution for a problem. However, this is generally not the case for most problems and choice of reference is dependent on the particular application of interest.
Let and be the means of random variables corresponding to the probability models and with respective densities and . Then -divergence, , expressed in terms of the means and can be written as
[TABLE]
-divergence between two densities and was introduced by Basu et al. (1998) and Eguchi & Kano (2001). It has been used by Févotte & Idier (2011) for non-negative matrix factorizations (NMF) where -divergence between two objects is considered. In our case, the means and represent these objects and we will follow this notation in the remainder of this section. It is well known that -divergence in equation (1.2) includes members of the exponential family of models such as the Gaussian , Poisson , gamma and inverse Gaussian models as special cases. Within this context, -divergence can be interpreted as generalized divergence indexed by the parameter (Devarajan & Cheung, 2016). For example, when we obtain the Gaussian likelihood , and, in the limit , we obtain the gamma likelihood . In the limit , we obtain the Poisson likelihood used in Lee & Seung (2001). These quantities are commonly referred to as Euclidean distance (ED), Itakuro-Saito (IS) divergence and divergence, respectively, in the NMF literature (Févotte & Idier, 2011; Devarajan & Cheung, 2014; Lee & Seung, 2001). However, it should be noted that our use of the term divergence has a broader connotation similar to that in Devarajan & Cheung (2014, 2016) and is based on its original definition outlined in Kullback (1951).
We define the generalized dual divergence of order by reversing the roles of and in equation (1.2). It is given by
[TABLE]
where the superscript is used to denote this dual form which also includes, as special cases, members of the exponential family of models as outlined above. When we obtain the Gaussian likelihood which is identical to ED, and, in the limit , we obtain which can be viewed as the dual version of IS divergence. Consider as a function of with fixed. Following Févotte & Idier (2011), we find that the first and second derivatives of with respect to given by
[TABLE]
and
[TABLE]
respectively, are continuous in . It is evident from equations (1.4) and (1.5) that has a unique minimum at and that it is convex in for (see Figure 1). This contrasts significantly with -divergence which is convex in only for (Févotte & Idier, 2011). For a scalar , also satisfies the scale property of , i.e.,
[TABLE]
Scale invariance is attained for the case in equation (1.3) (dual version of IS divergence).
2 Motivating NMF using generalized dual divergence
Lee and Seung (1999, 2001) developed NMF algorithms for decomposing a non-negative matrix into the product of lower dimensional non-negative matrices and such that , where is the factorization rank. In order to find an approximation for the input matrix , cost functions that quantify the quality of the approximation need to be constructed using some measure of divergence between and the reconstructed matrix . This problem can be formulated in the form of the linear model
[TABLE]
where is noise. Lee & Seung’s algorithms were based on ED,
[TABLE]
and the directed divergence measure,
[TABLE]
which correspond to the addition of Gaussian and Poisson noise, respectively, in (2.1). As noted earlier, the quantity in equation (2.2) can be derived as divergence between two Gaussian random variables with means and (and equal variance) and the quantity in equation (2.3) can be derived as divergence between two Poisson random variables with means and (see also Devarajan & Cheung, 2016). Unlike which is symmetric, , so Lee and Seung (2001) referred to as the divergence of from . In order to distinguish between the two directed divergences, and , we use the slight change in notation, , introduced in equation (1.3). Recently, Devarajan et al. (2015b) derived an algorithm for NMF using the directed divergence for the Poisson model given by
[TABLE]
This quantity can be derived as dual divergence between two Poisson random variables with means and as in equation (1.3). Similarly, Devarajan & Cheung (2014) developed NMF algorithms for signal-dependent noise using
[TABLE]
for the gamma model and
[TABLE]
for the inverse Gaussian model, quantities that can be derived based on dual divergence for the respective models when and in equation (1.3). Furthermore, Dhillon & Sra (2006) and Cichocki et al. (2009) have proposed NMF algorithms using some special cases of dual divergence.
Since the seminal work of Lee & Seung (2001), a variety of generalized divergence measures have been utilized for NMF in different applications. Examples include Cheung & Tresch (2005), Dhillon & Sra (2006), Kompass (2007), Cichocki et al. (2006, 2008, 2009, 2011), Févotte & Idier (2011) and Devarajan et al. (2015a,b; 2016). The works of Cheung & Tresch (2005), Cichocki et al. (2006), Févotte & Idier (2011) and Devarajan & Cheung (2016) are particularly relevant to the context of this paper. Cheung & Tresch (2005) rely directly on the likelihood approach while Cichocki et al. (2006) and Févotte & Idier (2011) utilize -divergence in equation (1.2). Recently, Devarajan & Cheung (2016) proposed a quasi-likelihood approach to NMF based on a unifying theoretical framework using the theory of generalized linear models. It includes all members of the exponential family of models and enables the use of link functions for modeling nonlinear effects. An underlying feature of all these approaches is that they are based on a generalization of divergence in some form or another, unified by the approach in Devarajan & Cheung (2016). Although NMF algorithms for various special cases of generalized dual divergence in (1.3) exist as outlined earlier, a unifying approach that integrates different models and algorithms into a single framework has been lacking.
Within the context of NMF, we can express generalized dual divergence of order between the input matrix and reconstructed matrix as
[TABLE]
using equation (1.3) and the re-parametrization . It is evident from (2.7) that represents a continuum of divergence measures indexed by the parameter . More importantly, it embeds the dual KL divergence of well-known models like the Gaussian (), Poisson (), gamma () and inverse Gaussian () models. When , it includes the compound Poisson (CP) model which is continuous for but allows exact zeros. By appropriately incorporating a dispersion parameter in (2.7), includes the quasi-Poisson model which is useful for modeling over- or under-dispersion as . Furthermore, it includes the extreme stable () and positive stable models () (Tweedie, 1981; Jorgensen, 1987).
Although -divergence includes members of the exponential family of models, it is evident from the work of Févotte & Idier (2011) that a unified NMF algorithm is not feasible due to the non-convexity of the objective function (1.2) for certain ranges of the parameter . It turns out that this is not the case with generalized dual divergence (2.7) and that a unified algorithm is indeed possible as shown in the following section. Here, we develop such an algorithm for NMF indexed by the parameter by minimizing the cost function in equation (2.7). Such an approach generalizes prior work the work of Devarajan & Cheung (2014) and Devarajan et al. (2015b) and embeds algorithms for members of the exponential family of models as special cases within a unifying statistical framework.
3 A unified NMF algorithm based on dual divergence
We derive a unified NMF algorithm where in equation (2.1) is a member of the class of models included in (2.7). One can ignore in (2.7) and define the function
[TABLE]
Thus, for any information measure which is proportional to we obtain equation (3.1). In the case of signal-dependent data such as those observed in various signal processing applications, the divergence in equation (3.1) offers a flexible choice in decomposing a high-dimensional matrix.
Theorem 1**.**
For , the measure in equation (3.1) is non-increasing under the multiplicative update rules for and given by
[TABLE]
and
[TABLE]
This measure is also invariant under these updates if and only if and are at a stationary point of the divergence.
Proof. We provide a more general proof of the monotonicity of updates based on splitting the domain of the parameter into three disjoint regions and considering them separately. The update rules for and obtained under all cases, however, are the same. A detailed proof of the monotonicity of updates and update rules for the special cases and are provided in Devarajan & Cheung (2014). In §3.1, we prove monotonicity of updates and derive update rules for the special case .
First, we derive the update for and prove its monotonicity when or . Then we show how similar arguments can be used to prove the result for . We will make use of an auxiliary function similar to the one used in the EM algorithm (Dempster et al., 1977; Lee & Seung, 2001; Devarajan & Cheung, 2016). Note that for real, is an auxiliary function for if and where and are scalar valued functions. Also, if is an auxiliary function, then is non-increasing under the update . Using the first equation in (3.1), we define
[TABLE]
where denotes the entry of . Then the auxiliary function for is
[TABLE]
It is straightforward to show that . To show that , we use the convexity of when or and the fact that for any convex function for rational nonnegative numbers such that . We then obtain
[TABLE]
where . From this inequality it follows that . The minimizer of is obtained by solving
[TABLE]
The update rule for thus takes the form given in (3.2). For , using the second equation in (3.1) we define
[TABLE]
and the auxiliary function for as
[TABLE]
It is easy to see that . By using the convexity of for , we can show that and proceed to obtain the update rule for as described above. The update rule for this case is exactly as that specified for the case or . By using symmetry of the decomposition and by reversing the arguments on , one can easily obtain the update rule for given in (3.3) in the same manner as .
For a given , we will start with random initial values for and and iterate until convergence, i.e, iterate until where is a pre-specified threshold between [math] and and denotes iteration number.
3.1 Special Cases
As noted before, for the Gaussian model corresponding to . Hence the NMF algorithm for the Gaussian model based on dual KL divergence is identical o the standard algorithm based on Euclidean distance outlined in Lee & Seung (2001) (Devarajan & Cheung, 2014). When and in equation (2.7), we obtain dual divergence for the gamma and inverse Gaussian models in equations (2.5) and (2.6), respectively. As noted earlier, NMF algorithms for these two models have been described in Devarajan & Cheung (2014) where monotonicity of updates was proved and update rules were derived for each model. Even though the gamma model is obtained as the limiting case in (3.1), closed form update rules for and can be obtained using in the generalized update rules in equations (3.2) and (3.3). The Poisson special case is discussed below.
3.1.1 Poisson Model
When in equation (2.7), we obtain dual divergence for the Poisson model given in equation (2.4). Devarajan et al. (2015b) provide an algorithm for this model involving multiplicative updates for and but without a formal proof. These update rules are obtained from (3.2) and (3.3) in the limit and are derived in Theorem 2 below.
Theorem 2**.**
The measure in equation (2.4) is non-increasing under the multiplicative update rules for and given by
[TABLE]
and
[TABLE]
This measure is also invariant under these updates if and only if and are at a stationary point of the divergence.
Proof. Using (3.2), the update rule for for the Poisson model can be written as
[TABLE]
The right hand side of (3.8) can be re-written as a function of as
[TABLE]
Using (3.9) in (3.8) and taking logarithm on both sides, we get
[TABLE]
Applying l’Hospital’s rule to compute the limit, we obtain
[TABLE]
Hence
[TABLE]
Similarly, the update rule for can be obtained as specified in (3.7). Monotonicity of these updates follows directly from the monotonicity of generalized updates in equations (3.2) and (3.3) established in Theorem 1 when .
4 Measuring Goodness-of-fit
The updates derived in equations (3.2), (3.3), (3.6) and (3.7) ensure monotonicity of updates for a given run of the NMF algorithm for pre-specified and rank , based on random initial values for and . However, NMF algorithms are typically prone to the problem of local minima and, thus, require the algorithm using multiple random restarts. The factorization from the run that produces the best reconstruction, quantified by minimum reconstruction error across multiple runs, can be used for assessing goodness-of-fit. Following Devarajan & Cheung (2014, 2016), we propose a unified measure for this purpose based on model-specific minimum reconstruction error, . It quantifies the variation explained by the continuum of statistical models contained in equation (3.1). For a given rank the proportion of explained variation, , is dependent on the particular model, determined by , used in the factorization and is computed as
[TABLE]
where is the numerator on the right hand side of equation (4.1), is as specified in equation (3.1) and represents the reconstructed matrix. For rank , the entry of is ; in the denominator, each entry is replaced by the grand mean of all entries of the input matrix , . Note that when , these quantities can be interpreted as the residual and total sum of squares, respectively, associated with the Gaussian model. For the nonlinear models indexed by in equation (3.1), measures the proportion of empirical uncertainty explained by the inclusion of and (Cameron & Windmeijer, 1997; Devarajan & Cheung, 2014; 2016).
5 Applications
Several special cases of the proposed unifying framework have been utilized for NMFs involving a variety of applications. For instance, Devarajan & Cheung (2014) derived algorithms based on dual divergence for gamma and inverse Gaussian models - using equations (2.5) and (2.6), respectively - for handling signal-dependent noise structures and demonstrated their application in electromyography studies for extraction of muscle synergies. These methods explained more variation () in the data at the appropriate number of synergies identified for each data set in a study involving frog motor behaviors under different experimental conditions. Similarly, Devarajan et al. (2015b) proposed an algorithm for the Poisson model based on dual divergence in equation (2.4) for unsupervised dimension reduction of discrete multivariate data. Two benchmark data sets - the Reuters news groups data and the Sacchromyces Genome Database (Shahnaz et al., 2006; Chagoyen et al., 2006) - were utilized for this purpose. In both cases, the algorithm based on dual divergence resulted in the best reconstruction compared to other competing methods. The proposed approach consolidates the above methods as well as a spectrum of other methods into a unifying framework and, thus, provides a flexible alternative for exploratory analysis of high dimensional data generated by diverse mechanisms that are exclusive to different applications.
6 Conclusions
In summary, this paper presented a unified approach to NMF based on generalized dual divergence along with a rigorous proof of convergence. The proposed approach is broadly applicable to the exponential family of models and is particularly useful in applications where there is a priori knowledge or empirical evidence of signal-dependence in noise. Furthermore, it unifies various existing algorithms and contrasts with the recently proposed quasi-likelihood approach, thus providing a complementary view of NMF. The basic principle underlying this framework is broadly extensible to the use of penalty, kernel and discriminant functions and to tensors.
Figure Legend
Figure 1, panels (a)-(d): Generalized dual divergence, equation (1.3), plotted as a function of for and various choices of , illustrating its convexity across the entire range of . The values of are indicated in the legend within each panel.
Acknowledgements
Research of the author was supported in part by NIH Grant P30 CA06927.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Basu, A., Harris, I.R., Hjort, N.L. and Jones, M.C. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika , 85(3):549–559.
- 2[2] Cameron, A.C., Windmeijer, F.A.G. (1997). An R-squared measure of goodness of fit for some common nonlinear regression models. Journal of Econometrics , 77(2):329-342.
- 3[3] Chagoyen, M., Carmona-Saez, P., Shatkay, H., Carazo, J.M., Pascual-Montano, A. (2006). Discovering semantic features in the literature: a foundation for building functional associations. BMC Bioinformatics . 7:41.
- 4[4] Cheung, V.C.K. Tresch, M.C. (2005). Nonnegative matrix factorization algorithms modeling noise distributions within the exponential family. Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference , 4990-4993.
- 5[5] Cichocki, A., Zdunek, R., Amari, S. (2006). Csiszar’s divergences for non-negative matrix factorization: Family of new algorithms. Lecture Notes in Computer Science, Independent Component Analysis and Blind Signal Separation , Springer, LNCS-3889, 32-39.
- 6[6] Cichocki, A., Lee, H., Kim, Y.-D., Choi, S. (2008). Non-negative matrix factorization with α 𝛼 \alpha -divergence. Pattern Recognition Letters , 29(9):1433-1440.
- 7[7] Cichocki, A., Zdunek, R., Phan, A.H., Amari, S. (2009). Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation . John Wiley.
- 8[8] Cichocki, A., Cruces, S., Amari, S. (2011). Generalized Alpha-Beta divergences and their application to robust nonnegative matrix factorization. Entropy , 13:134-170.
