Efficient Learning of Mixed Membership Models
Zilong Tan, Sayan Mukherjee

TL;DR
This paper introduces an efficient algorithm for learning mixed membership models that significantly reduces computational complexity and addresses issues with empirical estimators, demonstrating competitive results on various datasets.
Contribution
The paper proposes a novel, scalable algorithm for mixed membership models that improves over tensor methods and provides theoretical guarantees.
Findings
Reduces tensor decomposition complexity from O(p^3) to factorizing O(p/k) sub-tensors.
Addresses negative entries in empirical moment estimators with provable conditions.
Achieves competitive results on simulated and real datasets.
Abstract
We present an efficient algorithm for learning mixed membership models when the number of variables is much larger than the number of hidden components . This algorithm reduces the computational complexity of state-of-the-art tensor methods, which require decomposing an tensor, to factorizing sub-tensors each of size . In addition, we address the issue of negative entries in the empirical method of moments based estimators. We provide sufficient conditions under which our approach has provable guarantees. Our approach obtains competitive empirical results on both simulated and real data.
| Dataset | Birds | RTE | TREC | Dogs | Web |
|---|---|---|---|---|---|
| ptpqp | 11.11 | 7.75 | 30.81 | 15.37 | 14.44 |
| hals | 12.96 | 7.75 | 31.47 | 20.57 | 26.84 |
| tpm | 11.11 | 7.62 | 31.87 | 15.49 | 14.70 |
| nojd0 | 12.04 | 8.00 | 32.97 | 15.49 | 18.39 |
| nojd1 | 12.04 | 8.00 | 35.91 | 15.86 | 25.97 |
| MV+EM | 11.11 | 7.12 | 30.20 | 15.86 | 15.91 |
| Size | 108 | 800 | 19033 | 807 | 2665 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Tensor decomposition and applications · Bayesian Methods and Mixture Models
Efficient Learning of Mixed Membership Models
Zilong Tan
Department of Computer Science
Duke University
Email: [email protected]
Sayan Mukherjee
Departments of Statistical Science
Computer Science, Mathematics,
Biostatistics & Bioinformatics
Duke University
Email: [email protected]
Abstract
We present an efficient algorithm for learning mixed membership models when the number of variables is much larger than the number of hidden components . This algorithm reduces the computational complexity of state-of-the-art tensor methods, which require decomposing an tensor, to factorizing sub-tensors each of size . In addition, we address the issue of negative entries in the empirical method of moments based estimators. We provide sufficient conditions under which our approach has provable guarantees. Our approach obtains competitive empirical results on both simulated and real data.
1 Introduction
Mixed membership models [36, 24, 25, 4, 11] have been used extensively across applications ranging from modeling population structure in genetics [24, 25] to topic modeling of documents [36, 4, 11]. Mixed membership models use Dirichlet latent variables to define cluster membership where samples can partially belong to each of latent components. Parameter estimation for such latent variables models (LVMs) using maximum likelihood methods such as expectation maximization is computationally intensive for large data, for example, if number of samples is large.
Parameter estimation using the method of moments for LVMs is an attractive scalable alternative that has been shown to have certain theoretical and computational advantages over maximum likelihood methods in the setting when is large. For LVMs, method of moments approaches reduce to tensor methods—the moments of the model parameters are expressed as a function of statistics of the observations in a tensor form. Inference in this setting becomes a problem of tensor factorization. Computational advantages of using tensor methods have been observed for many popular models, including latent Dirichlet allocation [1], spherical Gaussian mixture models [15], hidden Markov models [1], independent component analysis [8], and multi-view models [2]. An appealing property of tensor methods is the guarantee of a unique decomposition under mild conditions [19, 22].
There are two complications to using standard tensor decomposition methods [3, 2, 14, 20, 23, 17, 7] for LVMs. The first problem is computation and space complexity. Given variables in the LVM, parameter inference requires factorizing typically a non-orthogonal estimator tensor of size [2, 20, 23], which is prohibitive for large . When the estimator is orthogonal and symmetric, this can be done in [34]. Online tensor decomposition [16] uses dimension reduction to instead factorize a reduced -by--by- tensor. However, the dimension reduction can be slower than decomposing the estimator directly for large sample sizes, as well as suffer from high variance [34]. We introduce a simple factorization with improved complexity for the general case where the parameters are not required to be orthogonal.
The second problem arises from negative entries in the empirical moments tensor. LVMs for count data are constrained to have nonnegative parameters. However, the empirical moments tensor computed from the data may contain negative elements due to sampling variation and noise. Indeed, for small sample sizes or data with many small or zero counts, there will be many negative entries in the empirical moments tensor. General tensor decomposition algorithms [20, 23], including the tensor power method (TPM) [2], do not guarantee the nonnegativity of model parameters. Approaches such as positive/nonnegative tensor factorization [7, 31, 35] also do not address this situation as they require all the elements of the tensor to be factorized to be nonnegative. With robust tensor methods [3, 14], sparse negative entries may potentially be treated as corrupted elements; however, these methods are not applicable in this setting since there can be many negative elements.
In this paper, we introduce a novel parameter inference algorithm called partitioned tensor parallel quadratic programming (PTPQP) that is efficient in the setting where the number of variables is much larger than the number of latent components . The algorithm is also robust to negative entries in the empirical moments tensor. There are two key innovations in the PTPQP algorithm. The first innovation is a partitioning technique which recovers the parameters through factorizing much smaller sub-tensors each of size . This technique can also be combined with methods [34, 32, 16] to obtain further improved complexities. The second innovation is a parallel quadratic programming [5] based algorithm to factor tensors with negative entries under the constraint that the factors are all nonnegative. To the best of our knowledge, this is the first algorithm designed to address the problem of negative entries in empirical estimator tensors. We show that the proposed factorization algorithm converges linearly with respect to each factor matrix. We also provide sufficient conditions under which the partitioned factorization scheme is consistent, the parameter estimates converge to the true parameters.
2 Preliminaries
Notations
We use bold lowercase letters to represent vectors and bold capital letters for matrices. Tensors are denoted by calligraphic capital letters. The subscript notation refers to -th column of matrix . We denote the -th column of the identity matrix as and is a vector of ones. We further write for a diagonal matrix whose diagonal entries are , and to mean a vector of the diagonal entries of .
Element-wise matrix operators include and , e.g., means that has nonnegative entries. refers to element-wise . and respectively represent element-wise multiplication and division. Moreover, refers to the outer product and denotes the Khatri-Rao product. and represent the Frobenius norm and spectral norm, respectively.
Tensor basics
This paper uses similar tensor notations as [18]. In particular, we are primarily concerned with Kruskal tensors in , which can be expressed in the form of
[TABLE]
where , , and are respectively -by-, -by-, and -by- factor matrices. The rank of is defined as the smallest that admits such a decomposition. The decomposition is known as the CP (CANDECOMP/PARAFAC) decomposition. The -mode unfolding of , denoted by , for is a -by- matrix whose rows are serializations of the tensor fixing the index of the -th dimension. The unfoldings have the following well-known compact expressions:
[TABLE]
3 Learning through Method of Moments
3.1 Generalized Dirichlet latent variable models
A generalized Dirichlet latent variable model (GDLM) was proposed in [39] for the joint distribution of observations . Each observation consists of variables . GDLM assumes a generative process involving hidden components. For each observation, sample a random Dirichlet vector with concentration parameter . The elements of are the membership probabilities for to belong to each of the components. Specifically,
[TABLE]
where is the density of the -th variable specific to component with parameter . One advantage of GLDM is that can take categorical values. Let denote the number of categories for the -th variable (set for scalar variables), becomes a -by- probability matrix where the -th row corresponds to category . We aim to accurately recover from independent copies of involving variables of mixed data types, either categorical or non-categorical.
3.2 Moment-based estimators
The moment estimators of latent variable models typically take the form of a tensor [2]. Consider the estimators of GDLM [39] for example. Let if variable is categorical; otherwise. The second- and third- order parameter estimators for variable , , and are written
[TABLE]
Alternatively, and have the following CP decomposition into parameters :
[TABLE]
Appendix A provides the derivation details of these estimators. For the special case of latent Dirichlet allocation, and are scalar joint probabilities.
The parameters are typically obtained by factorizing the block tensor whose -th element is the empirical and/or whose -th element is the empirical [3, 2, 39]. Note that are generally non-orthogonal, and thus preprocessing steps (see Appendix B) are needed for orthogonal decomposition methods [34, 32, 2]. The preprocessing can be expensive and often leads to suboptimal performance [33, 23]. Here, we highlight a few relevant observations:
- •
alone does not yield unique parameters due to the well-known rotation problem. Suppose that and are the ground-truth parameters satisfying (3) and any invertible , there exists decomposition and that also satisfy (3) but are not ground-truth parameters. The ground-truth parameters are not uniquely identifiable through , this is true even when enforcing nonnegativity constraints on parameters [10].
- •
is sufficient to uniquely recover the parameters under certain mild conditions [19]; for example, when any two of , , and have linearly independent columns and the columns of the third are pair-wise linearly independent [22].
- •
The empirical estimator generally contains negative entries due to variance and noise. The fraction of negative entries can approach 50%, as we shall see in experiments. We address this issue in § 4.4.
- •
While the decomposition (4) can be unique up to permutation and rescaling, the correspondence between each column of the factor matrix and each hidden component may not be consistent across multiple decompositions. Techniques for achieving consistency are developed in § 4.2.
3.3 Computational complexity
Tensor methods such as TPM typically decompose the full estimator tensor that includes all variables. More efficient algorithms have been developed for the case that parameters are orthogonal [34, 32], and when the sample size is small [16]. However, these methods do not apply in the general case where the parameters are non-orthogonal and the sample size can be potentially large. A key insight underlying our approach is that it is sufficient to recover the parameters by factorizing only much smaller sub-tensors each of size . This technique can also be combined with the aforementioned methods to further improve the complexity in certain cases.
4 An efficient algorithm
In this section, we develop partitioned tensor parallel quadratic programming (PTPQP) an efficient approximate algorithm for learning mixed membership models. We first introduce a novel partitioning-and-matching scheme that reduces parameter estimation to factorizing a sequence of sub-tensors. Then, we develop a nonnegative factorization algorithm that can handle negative entries in the sub-tensors.
4.1 Partitioned factorization
Factorizing the full tensor formed by all is expensive while a three-variable tensor in (4) alone may not be sufficient to determine when is large. In this section, we consider factorizing the sub-tensors corresponding to a cover of the set of variables such that each sub-tensor admits an identifiable CP decomposition (1), i.e. unique up to permutation and rescaling of columns. This gives the parameters for all variables. Suppose that and the maximum number of categories is a constant, the aggregated size of the sub-tensors can be much smaller, i.e., , than the size of the full estimator.
Let , , and denote ordered subsets , with cardinality , , and , respectively. Consider the -by--by- block tensor 111For block tensor operations, see e.g., [26]. whose -th element is the tensor . From (4), we have that
[TABLE]
Clearly, the block tensor is identifiable if it has an identifiable sub-tensor. Suppose that a sub-tensor is identifiable, then one can construct an identifiable tensor from by setting
[TABLE]
We further remark that a sub-tensor can be identifiable under mild conditions, for example, if the sum of the Kruskal rank of the three factor matrices is at least than [19].
Given an identifiable sub-tensor of anchor variables indexed by , , and , the partitioning produces a set of sub-tensors (partitions) constructed through (6), that includes all variables. Thus, is a common sub-tensor shared across all partitions. We choose anchor variables whose parameter matrices are of full column rank to obtain an identifiable . Finally, one can divide the rest of variables evenly and randomly into the partitions.
4.2 Matching parameters with hidden components
Since the factorization of a partition (5) can only be identifiable up to permutation and rescaling of the columns of constituent , the correspondence between the columns of and hidden components can differ across partitions. To enforce consistency, we associate a permutation operator for each variable such that are the parameters specific to hidden component across all variables . Consider the following vector representation of :
[TABLE]
Observe that within a factorization of , and this also holds for the partitioned factorization (5) of as well, i.e.,
Consider the factorizations of and and suppose that . The permutation operator for one factorization is determined given the other by column matching the parameters of variable in both factorizations. Thus, an inductive way to achieve a consistent factorization is to start with one factorization, and let its permutation be the identity , then perform the factorization over new sets of variables with at least one variable in common with the initial factorization. Permutations for the sequential factorizations are determined via column matching parameter matrices of the common variables.
Given two factorized parameter matrices and of variable , our goal is to find a consistent permutation (of with respect to ) such that and correspond to the same hidden component for all . We now present an algorithm with provable guarantees to compute a consistent permutation.
Smallest angle matching
A simple matching algorithm is to match the two columns of the two parameter matrices that have the smallest angle between them. Consider the factorizations of and which yield respectively parameters and for the common variable . Given the permutation for , the permutation for is computed by:
[TABLE]
Here, and represent respectively the normalized and with each column having unit Euclidean norm.
There are cases that computed via (7) is not consistent: 1) contains duplicate entries and hence is ineligible; and 2) since and are the factorized parameter matrices which are generally perturbed from the ground-truth, the resulting may differ from the consistent permutation. To cope with these cases, we establish in § 5 the sufficient conditions for to be consistent.
Orthogonal Procrustes matching
One issue with the smallest angle matching is that each column is paired independently. It is easy for multiple columns to be paired with a common nearest neighbor. We describe a more robust algorithm based on the orthogonal Procrustes problem, and show improved guarantees. Since a consistent permutation is orthogonal, a natural relaxation is to only require the operator to be orthogonal. This is an orthogonal Procrustes problem, formulated in the same settings as § 4.2
[TABLE]
Let be the singular value decomposition (SVD), the solution is given by the polar factor [29]
[TABLE]
Here, is orthogonal and does not immediately imply the desired permutation . To compute , one can additionally restrict to be a permutation matrix, and solve for using linear programming [13]. Aside from efficiency, one fundamental question is that under what assumptions the objective (8) yields the consistent permutation.
Given the solution to the Procrustes problem, we propose the following simple algorithm for computing :
[TABLE]
We first establish through Theorem 1 that if obtained using (10) is a valid permutation, i.e., no duplicate entries, then it is optimal in terms of the objective (8).
Theorem 1**.**
The obtained using satisfies
[TABLE]
for all permutations .
Proof.
First, rewrite the objective (8) as follows
[TABLE]
Recall the SVD , and write . Keeping only terms that depend on in (11) to obtain . Thus, the optimization (8) is equivalent to
[TABLE]
From the constraint, we obtain . The optimization now becomes
[TABLE]
Let us now restrict each column of to be in , but not necessarily distinct. Suppose that . We have that . Clearly, (12) and hence (8) are minimized with . ∎
In section § 5 we state sufficient conditions under which the objective (8) yields a consistent permutation.
4.3 Approximate nonnegative factorization
In previous sections, we reduced the inference problem to factorizing partitioned sub-tensors. We now present a factorization algorithm for the sub-tensors that contain negative entries. Our goal is to approximate a sub-tensor by a sub-tensor where the factors , , and are nonnegative. The Frobenius norm is used to quantify the approximation
[TABLE]
Note that we do not assume that in (13) which distinguishes our optimization problem from other approximate factorization algorithms [35, 7, 17, 31]. In § 4.4, we provide some details as to why negative entries are problematic for standard approximate factorization algorithms. We can rewrite (13) using the 1-mode unfolding as
[TABLE]
Equivalent formulations with respect to the 2-mode and 3-mode unfoldings can be readily obtained from (2).
We point out that another widely-used error measure — the I-divergence [12, 7] — may not be suitable for our learning problem. The optimization using I-divergence is given by
[TABLE]
This optimization is useful for nonnegative when each entry follows a Poisson distribution. In this case, the objective is equivalent to the sum of Kullback-Leibler divergence across all entries of :
[TABLE]
However, the Poisson assumption does not generally hold for the estimator tensor (4).
4.4 Handling negative entries in empirical estimators
We first illustrate that factorizing a tensor with negative entries using either positive tensor factorization [35] or nonnegative tensor factorization [7, 31] will either result in factors that violate the the nonnegativity constraint or the result of the algorithm diverges. In addition, we show that general tensor decompositions cannot enforce the factor nonnegativity even after rounding the negative entries to zero.
We then present a simple method based on weighted nonnegative matrix factorization (WNMF) [37] that enforce the factor nonnegativity constraint. We further generalize this method using parallel quadratic programming (PQP) [5] to obtain a method with a provable convergence rate.
Issue of negative entries
If the tensor is strictly nonnegative, the optimization specified in (13) can be reduced to nonnegative matrix factorization (NMF). Solvers abound for NMF including the celebrated Lee-Seung’s multiplicative updates [21]. The reduction is done by viewing (14) as with , , and , and alternating
[TABLE]
over each unfolding and factor matrix . Obviously, the updates may yield negative entries in when the unfolding contains negative entries. In addition, convergence relies on the nonnegativity of the unfolding [21]. This issue extends to their tensor factorization variants [35, 7, 17] known as the positive tensor factorization and nonnegative tensor factorization. For these approaches, a naive resolution is to round negative entries of to [math], this however lacks theoretical guarantees.
It is important to note that the rounding does not help general tensor decompositions like TPM. The following example illustrates that the unique decomposition (up to permutation and rescaling) of a positive tensor can contain negative entries. Consider a -by--by- positive tensor, whose -mode unfolding is given by
[TABLE]
where the vertical bar separates two frontal slices. It has the following decomposition, written in the form of (1):
[TABLE]
Since all factors are of full-rank, the decomposition is unique up to permutation and rescaling of columns [19]. Thus, a general tensor decomposition yields a with negative entries regardless of rescaling.
4.5 Factorization via WNMF
Since the ground-truth are nonnegative, we may “ignore” the negative entries of by treating them as missing values. This idea leads to the following modified objective:
[TABLE]
where , , are chosen identically as (15), and we define
[TABLE]
The optimization can be carried out using WNMF. Here, we modify the original updates by introducing a positive constant to ensure that the updates are well-defined:
[TABLE]
Theorem 2 states the correctness of the modified updates (17).
Theorem 2**.**
The objective (16) is non-increasing under the multiplicative updates (17).
Proof.
We prove the update for , and the update for follows by applying the update to . First, consider the error Frobenius norm for a column of , and the corresponding columns of and of ,
[TABLE]
The following is an auxiliary function of :
[TABLE]
where we define
[TABLE]
Clearly, , and one can show that by rewriting
[TABLE]
where we note that from the Boolean definition of . Comparing (18) with (19), it is sufficient to show that is positive semi-definite. Now consider the scaled matrix
[TABLE]
Observe that is strictly diagonally dominant as and the off-diagonal entries are negative. Also note that all diagonal entries of are positive, it follows that is positive semi-definite. We thereby conclude that is positive semi-definite.
Let , we have that . The minimizer is obtained by setting , which yields
[TABLE]
The particular choice of guarantees that is always positive. ∎
4.6 Parallel quadratic programming
We now generalize the WNMF approach using parallel quadratic programming to obtain a convergence rate. Let denote the set of symmetric positive definite matrices, we consider the following optimization problem
[TABLE]
which can be solved by iterating multiplicative updates [5, 30]. We use the parallel quadratic programming (PQP) algorithm [5, 6] to solve (20), partly because it has a provable linear convergence rate. The PQP multiplicative update for (20) takes the following simple form:
[TABLE]
with
[TABLE]
Here and are arguments to PQP, we will discuss these arguments in section § 5.2. The update maintains nonnegativity since all items are nonnegative. We make the following observation.
Theorem 3**.**
The multiplicative updates for Lee-Seung and WNMF are special cases of PQP.
Proof.
Since the WNMF (16) generalizes Lee-Seung, which is the case that has all ones, we need only to prove for WNMF. Let and , some matrix algebra reveals the following PQP updates
[TABLE]
Comparing (22) to (17), they are equivalent if . ∎
We can now solve the approximate nonnegative factorization problem stated in (13) using (21). Theorem 4 states the multiplicative updates. A more detailed discussion of is included in § 5.2. We present pseudo-code in Algorithm 1.
Theorem 4**.**
For optimization (13), the following update converges linearly to a local optimum
[TABLE]
with
[TABLE]
where is the smallest eigenvalue. Similar updates for and are obtained using (2).
Proof.
We apply PQP updates (21) to each row of . Let and be the -th row of and , respectively. Fixing the current factor estimates and , the optimization with respect to follows from (14):
[TABLE]
Now the updates (21) can be applied immediately, where we set and according to Theorem 7 in § 5. Using the identity and performing the updates simultaneous for all rows of give (23).
∎
4.7 Proposed approach
To summarize, the proposed approach, referred to as PTPQP, consists of three steps. Given the indexes of anchor variables , the variables are first evenly divided into partitions, and the anchor variables are added to each partition. The second step consists of forming and factorizing the sub-tensor of each partition using Algorithm 1, this step can be parallelized. Third, normalize the anchor matrix formed by the anchor variable parameters to have unit column Euclidean norm, and then use either (7) or (10) to match over the anchor matrix.
Efficiency
Most of the computational cost is in the factorization. Consider one partition, and let be the corresponding sub-tensor, the sub-tensor size is . The maximum number of categories for a variable is generally a constant for the GDLM. Under smallest partitioning, this size is determined by the sub-tensor of anchor variables, i.e., , which corresponds to partitions. One benefit of PTPQP is that the number of sub-tensor factorizations is linear in due to the partitioned factorization, this results in significant efficiency gains when . Furthermore, PTPQP is easy to be parallelized across multiple CPUs and machines, since the computation as well as data are not distributed across partitions.
5 Provable Guarantees
In this section, we state the main theoretical results of the proposed partitioned factorization and tensor PQP factorization.
5.1 Sufficient conditions for guaranteed matching
Theorem 5 and Theorem 6 state that when the anchor parameter matrices from two factorizations are “close”, the proposed matching algorithms obtain a consistent permutation.
Theorem 5**.**
Suppose that is the ground-truth matrix for variable . Solving (7) results in a consistent permutation if for all factors of variable
[TABLE]
for all , where .
Proof.
Consider the smallest pair-wise angle between the columns of , we have that
[TABLE]
Denote by the maximum angle between the column of a factorized parameter matrix and the corresponding column of the ground-truth. It is sufficient to ensure that
[TABLE]
Consider any two columns of the ground-truth parameter matrix, and the corresponding perturbed columns and from two factorizations. We have that
[TABLE]
From (24), we have that
[TABLE]
as desired for (7) to work correctly. Now consider the inner product of a perturbed column and the ground-truth, it holds that
[TABLE]
Thus, a sufficient condition for (7) to yield the consistent permutation is
[TABLE]
which written in analytic form proves the theorem. ∎
Theorem 5 states that one obtains a consistent permutation by solving (7) in the columns of the ground-truth parameter matrix are distinct from each other in angles and the factorized parameter matrix is near the ground-truth in Frobenius norm. Thus, a good anchor variable for the partitioned factorization (5) is one whose parameter matrix has distant columns in angles.
The bound in Theorem 5 can be made sharp for certain , and thus the smallest angle matching algorithm has general guarantees only when the perturbation is small, i.e., the relative error ratio is less than .
Theorem 6**.**
Suppose that and are two factorized parameter matrices for a variable. Solving (10) results in a consistent permutation , if
[TABLE]
with
[TABLE]
where the error matrix is define as , and denotes the -th largest singular value.
The proof of Theorem 6 follows from the following two Lemmas.
Lemma 1**.**
Suppose that is the consistent permutation of with respect to . Formula (10) is guaranteed to recover , if
[TABLE]
where and are the left and right singular matrices of .
Proof.
We need to show that (10) yields for the orthogonal Procrustes problem . From the solution (9), it is easy to show that the minimizer of and the minimizer of satisfy
[TABLE]
Note that is the desired minimizer of , and thus it remains to show that (10) gives when applied to , or equivalently
[TABLE]
Since the row and column vectors of have unit Euclidean norm, the following dual statements imply each other
[TABLE]
if condition holds. Under this condition, we also have that (10) gives the identity permutation for the orthogonal Procrustes problem . Thus, applying (10) to both sides of (26) yields
[TABLE]
which implies (27) from (28). ∎
Lemma 2**.**
*(Mathias)
Suppose that is nonsingular. Then for any with and any unitarily invariant norm , it holds that*
[TABLE]
where represents the unitary factor of the polar decomposition, and is the Ky Fan -norm.
Proof of Theorem 6.
Let , which has the same singular values as . Denote by the unitary factor of the polar decomposition. Using the fact that , the sufficient condition of Lemma 1 is restated as
[TABLE]
Also note that
[TABLE]
Thus, it suffices to enforce the right term to be less than . From Lemma 2, this can be achieved by letting
[TABLE]
∎
The first condition in Theorem 6 requires that at least one of and must have full column rank. We may exchange and in Theorem 6 to first obtain the consistent permutation of with respect to , then follows immediately.
Theorem 6 states that solving (10) recovers a consistent permutation whenever the error spectral norm is small as compared to the smallest singular value of . This is especially useful for with the number of rows much larger than the number of columns . In particular, for with independent and identically distributed subgaussian entries, is at least of the order [28].
5.2 Convergence
The following theorem states a sufficient condition for PQP to achieve linear convergence rate. The theorem statement and proof is an adaptation of results stated in [5]—the proof in [5] overlooks a required condition on and the condition in the original proof is unnecessary.
Theorem 7**.**
The PQP algorithm given by (21) monotonically decreases the objective (20) and has linear convergence, if
[TABLE]
where is the smallest eigenvalue.
Proof.
First, the condition suffices to ensure that the updates monotonically decrease (20) [6]. Thus, it remains to show the condition on . Suppose that the -th element of the optimum is perturbed by a non-zero . Let , and applying one update gives . Denote the -th row of , , and respectively by , , and , then it holds that and by definition. We now consider the ratio of errors between successive iterations:
[TABLE]
From the KKT first-order optimality condition , we simplify the ratio as
[TABLE]
Observe that the denominator is nonnegative. We also have that the denominator is greater than the numerator using the KKT optimality condition :
[TABLE]
To achieve linear convergence rate, we may enforce the ratio to be less than one. Equivalently,
[TABLE]
It suffices to set
[TABLE]
To get rid of in (31), we have the following inequality
[TABLE]
where the right term is the negative of the minimum of the unconstrained problem, assuming that is non-singular. If is singular, then can be unbounded. Further simplify the inequality using KKT optimality conditions as
[TABLE]
Combining with (31) completes the proof. ∎
6 Results on real and simulated data
We compare the proposed algorithm ptpqp with state-of-the-art approaches including: 1) the tensor power method tpm [2] and matrix simultaneous diagonalization, nojd0 and nojd1 [20]—two general tensor decomposition methods; 2) nonnegative tensor factorization hals [17]; and 3) generalized method of moments meld [39]. We use the online code provided by the corresponding authors.
6.1 Learning GDLMs on simulated data
We adapt a simulation study from [39] to compare runtime and accuracy of parameter estimation. We consider a GDLM where each variable takes categorical values and the parameters of the Dirichlet mixing distribution are . We initially consider variables. The true parameters for each hidden component are drawn from the Dirichlet distribution . The resulting moment estimator is a -by--by- tensor. We vary the number of components and add noise by replacing a fraction of the observations with draws from a discrete uniform distribution. We also vary the number of samples , number of clusters , and contamination . Across these settings we found that the empirical third-order estimator typically exhibits between and negative entries.
Accuracy of inference
Accuracy is measured by root-mean-square error (RMSE) which we compare across algorithms as a function of the number of components for various sample sizes and levels of contamination, see Figure 1. Both hals and ptpqp are consistently among the top estimators, and ptpqp outperforms hals as grows. For small sample sizes and many hidden components meld achieves the smallest RMSE. The RMSE of tpm is relatively large, probably due to the whitening technique used to approximately transform the nonorthogonal factorization into an orthogonal one, see [33, 23]. The most relevant observation is that ptpqp outperforms other methods for large, noisy data.
Computational cost
We examined how runtime scales as a function of the number of partitions. For the same model we set variables and samples. The tensor is now -by--by-. We evaluated the runtime of ptpqp (without parallelization) with the number of partitions set to . On a laptop with Intel [email protected] CPU and 8GB memory, ptpqp with partitions completes within min, min, and min for , respectively. In addition, the runtime monotonically decreases with the number of partitions. Further speedups can be obtained by parallelizing the factorization of partitions across multiple CPUs or machines.
6.2 Predicting crowdsourced labels
In [38], a combination of EM and tensor decompositions was used to predict crowdsourcing annotations. The task is to predict the true label given incomplete and noisy observations from a set of workers, this is a mixed membership problem [9]. In [38] a third-order tensor estimator was proposed to obtain an initial estimate for the EM algorithm. We compare the predictive performance on five data sets of several tensor decomposition methods as well as the EM algorithm initialized with majority voting by the workers (MV+EM). The fraction of incorrect predictions and the size of each dataset are in the table below. Note that ptpqp matches or outperforms the other tensor methods on all but one dataset, and even outperforms MV+EM on two datasets.
7 Conclusions
We proposed an efficient algorithm for learning mixed mixture models based on the idea of partitioned factorizations. The key challenge is to consistently match the partitioned parameters with the hidden components. We provided sufficient conditions to ensure consistency. In addition, we have also developed a nonnegative approximation to handle the negative entries in the empirical method of moments estimators, a problem not addressed by several recent tensor methods. Results on synthetic and real data corroborate that the proposed approach achieves improved inference accuracy as well as computational efficiency than state-of-the-art methods.
Code
Code for all the simulations is available from Zilong Tan’s GitHub repository
https://github.com/ZilongTan/ptpqp.
Acknowledgements
Z.T. would like to thank Rong Ge for sharing helpful insights. S.M. would like to thank Lek-Heng Lim for insights. Z.T. would like to acknowledge the support of grants NSF CNS-1423128, NSF IIS-1423124, and NSF CNS-1218981. S.M. would like to acknowledge the support of grants NSF IIS-1546331, NSF DMS-1418261, NSF IIS-1320357, NSF DMS-1045153, and NSF DMS-1613261.
Appendix A Dirichlet Moments
For a Dirichlet random vector with concentration parameters , the component moments can be easily shown by the integral
[TABLE]
where . Comparing the second and third order component moments, we arrive at the following cross-moments:
[TABLE]
To express the parameters as third-order cross-moments, first observe that the following holds for a Dirichlet random vector :
[TABLE]
This is an immediate result from the the second-order component moments. Combining with (34) yields
[TABLE]
A.1 Derivation of moment estimators
Our goal is to derive the estimators of parameter vectors for each variable using the first- and second- order empirical cross-moments of . In GDLM, the expectation of variable conditioned on is written
[TABLE]
Thus, the expected observation of variable is given by
[TABLE]
Now consider two variables and which are generated with the same latent factors . Combining (36) and (33) to obtain
[TABLE]
For three variables , , and , we can write . Using (35), we establish that
[TABLE]
Appendix B Approximate Orthogonalization in the Tensor Power Method
TPM requires the tensor to be decomposed to be symmetric, and the factor matrices to be orthogonal. Specifically, it performs the following decomposition
[TABLE]
where are orthonormal vectors. Thus, TPM does not immediately apply to the general CP decomposition (1).
The general resolution is to first use the symmetric tensor embedding [27, 3], forming a larger symmetric tensor that contains the asymmetric tensor to be decomposed. The formed is a sparse -by--by- tensor of which entries are zero. The space and computation complexities rapidly become prohibitive when the number of variables and the category counts grow.
Next, TPM requires an addition empirical second-order estimator for orthogonalizing the factor matrices of to obtain [2]. This is done by computing the whitening transformation from . However, the whitening technique based on empirical is often a cause of suboptimal performance [33, 23].
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Anandkumar, D. P. Foster, D. J. Hsu, S. M. Kakade, and Y. kai Liu. A spectral algorithm for latent Dirichlet allocation. In NIPS , pages 917–925. 2012.
- 2[2] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. JMLR , 15(1):2773–2832, Jan. 2014.
- 3[3] A. Anandkumar, P. Jain, Y. Shi, and U. N. Niranjan. Tensor vs. matrix methods: Robust tensor decomposition under block sparse perturbations. In AISTATS , pages 268–276, 2016.
- 4[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR , 3:993–1022, 2003.
- 5[5] M. Brand and D. Chen. Parallel quadratic programming for image processing. In IEEE International Conference on Image Processing (ICIP) , pages 2261–2264, Sept. 2011.
- 6[6] M. Brand, V. Shilpiekandula, and S. Bortoff. A parallel quadratic programming algorithm for model predictive control. In World Congress of the International Federation of Automatic Control (IFAC) , volume 18, Aug. 2011.
- 7[7] E. C. Chi and T. G. Kolda. On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications , 33(4):1272–1299, December 2012.
- 8[8] P. Comon and C. Jutten, editors. Handbook of blind source separation : independent component analysis and applications . Communications engineering. Elsevier, 2010.
