Efficient Learning of Mixed Membership Models

Zilong Tan; Sayan Mukherjee

arXiv:1702.07933·cs.LG·July 4, 2017

Efficient Learning of Mixed Membership Models

Zilong Tan, Sayan Mukherjee

PDF

Open Access 1 Repo

TL;DR

This paper introduces an efficient algorithm for learning mixed membership models that significantly reduces computational complexity and addresses issues with empirical estimators, demonstrating competitive results on various datasets.

Contribution

The paper proposes a novel, scalable algorithm for mixed membership models that improves over tensor methods and provides theoretical guarantees.

Findings

01

Reduces tensor decomposition complexity from O(p^3) to factorizing O(p/k) sub-tensors.

02

Addresses negative entries in empirical moment estimators with provable conditions.

03

Achieves competitive results on simulated and real datasets.

Abstract

We present an efficient algorithm for learning mixed membership models when the number of variables $p$ is much larger than the number of hidden components $k$ . This algorithm reduces the computational complexity of state-of-the-art tensor methods, which require decomposing an $O (p^{3})$ tensor, to factorizing $O (p / k)$ sub-tensors each of size $O (k^{3})$ . In addition, we address the issue of negative entries in the empirical method of moments based estimators. We provide sufficient conditions under which our approach has provable guarantees. Our approach obtains competitive empirical results on both simulated and real data.

Tables1

Table 1. Table 1: Incorrectly predicted labels (%)

Dataset	Birds	RTE	TREC	Dogs	Web
ptpqp	11.11	7.75	30.81	15.37	14.44
hals	12.96	7.75	31.47	20.57	26.84
tpm	11.11	7.62	31.87	15.49	14.70
nojd0	12.04	8.00	32.97	15.49	18.39
nojd1	12.04	8.00	35.91	15.86	25.97
MV+EM	11.11	7.12	30.20	15.86	15.91
Size	108	800	19033	807	2665

Equations171

T = j = 1 \sum r A_{j} \times B_{j} \times C_{j},

T = j = 1 \sum r A_{j} \times B_{j} \times C_{j},

T_{(1)} = A (C ⊙ B)^{⊤}, T_{(2)} = B (C ⊙ A)^{⊤}, T_{(3)} = C (B ⊙ A)^{⊤} .

T_{(1)} = A (C ⊙ B)^{⊤}, T_{(2)} = B (C ⊙ A)^{⊤}, T_{(3)} = C (B ⊙ A)^{⊤} .

y_{ij} \sim h = 1 \sum k x_{ih} g_{j} (θ_{j h}),

y_{ij} \sim h = 1 \sum k x_{ih} g_{j} (θ_{j h}),

M^{j s}

M^{j s}

M^{j s t}

= - \frac{α _{0}}{α _{0} + 2} (E [E [b_{ij}] \times b_{i s} \times b_{i t}] + E [b_{ij} \times E [b_{i s}] \times b_{i t}] + E [b_{ij} \times b_{i s} \times E [b_{i t}]]) .

M^{j s}

M^{j s}

M^{j s t}

M^{π^{j} π^{s} π^{t}} = h = 1 \sum k \frac{2 α _{h}}{α _{0} ( α _{0} + 1 ) ( α _{0} + 2 )} θ_{π_{1}^{j} h} θ_{π_{2}^{j} h} ⋮ θ_{π_{p_{j}}^{j} h} \times θ_{π_{1}^{s} h} θ_{π_{2}^{s} h} ⋮ θ_{π_{p_{s}}^{s} h} \times θ_{π_{1}^{t} h} θ_{π_{2}^{t} h} ⋮ θ_{π_{p_{t}}^{t} h} .

M^{π^{j} π^{s} π^{t}} = h = 1 \sum k \frac{2 α _{h}}{α _{0} ( α _{0} + 1 ) ( α _{0} + 2 )} θ_{π_{1}^{j} h} θ_{π_{2}^{j} h} ⋮ θ_{π_{p_{j}}^{j} h} \times θ_{π_{1}^{s} h} θ_{π_{2}^{s} h} ⋮ θ_{π_{p_{s}}^{s} h} \times θ_{π_{1}^{t} h} θ_{π_{2}^{t} h} ⋮ θ_{π_{p_{t}}^{t} h} .

π^{j'} = π^{j} \cup π^{u}, π^{s'} = π^{s} \cup π^{v}, π^{t'} = π^{t} \cup π^{w} .

π^{j'} = π^{j} \cup π^{u}, π^{s'} = π^{s} \cup π^{v}, π^{t'} = π^{t} \cup π^{w} .

ψ

ψ

ψ A

ψ_{s}^{u} = ar g t max (\overset{ˉ}{θ}_{j}^{'⊤} ψ^{j} \overset{ˉ}{θ}_{j})_{t s} .

ψ_{s}^{u} = ar g t max (\overset{ˉ}{θ}_{j}^{'⊤} ψ^{j} \overset{ˉ}{θ}_{j})_{t s} .

Ψ min \overset{ˉ}{θ}_{j}^{'} Ψ - ψ^{j} \overset{ˉ}{θ}_{j}_{F}^{2}, s.t. Ψ^{⊤} Ψ = I .

Ψ min \overset{ˉ}{θ}_{j}^{'} Ψ - ψ^{j} \overset{ˉ}{θ}_{j}_{F}^{2}, s.t. Ψ^{⊤} Ψ = I .

Ψ^{*}

Ψ^{*}

ψ_{s}^{u} = ar g t max Ψ_{t s}^{*} .

ψ_{s}^{u} = ar g t max Ψ_{t s}^{*} .

ψ^{u} \overset{ˉ}{θ}_{j}^{'} - ψ^{j} \overset{ˉ}{θ}_{j}_{F}^{2} \leq ψ \overset{ˉ}{θ}_{j}^{'} - ψ^{j} \overset{ˉ}{θ}_{j}_{F}^{2}

ψ^{u} \overset{ˉ}{θ}_{j}^{'} - ψ^{j} \overset{ˉ}{θ}_{j}_{F}^{2} \leq ψ \overset{ˉ}{θ}_{j}^{'} - ψ^{j} \overset{ˉ}{θ}_{j}_{F}^{2}

\overset{ˉ}{θ}_{j}^{'} Ψ - ψ^{j} \overset{ˉ}{θ}_{j}_{F}^{2}

\overset{ˉ}{θ}_{j}^{'} Ψ - ψ^{j} \overset{ˉ}{θ}_{j}_{F}^{2}

= \overset{ˉ}{θ}_{j}^{'}_{F}^{2} + \overset{ˉ}{θ}_{j}_{F}^{2} - 2 tr Ψ^{⊤} \overset{ˉ}{θ}_{j}^{'⊤} ψ^{j} \overset{ˉ}{θ}_{j} .

E max tr V^{⊤} E^{⊤} U Σ, s.t. (U V^{⊤} + E)^{⊤} (U V^{⊤} + E) = I .

E max tr V^{⊤} E^{⊤} U Σ, s.t. (U V^{⊤} + E)^{⊤} (U V^{⊤} + E) = I .

E min tr E^{⊤} E Σ = E min j \sum (E^{⊤} E)_{j j} Σ_{j j} .

E min tr E^{⊤} E Σ = E min j \sum (E^{⊤} E)_{j j} Σ_{j j} .

A, B, C ⪰ 0 min M - M_{F} .

A, B, C ⪰ 0 min M - M_{F} .

A, B, C ⪰ 0 min

A, B, C ⪰ 0 min

A, B, C ⪰ 0 min u, v, w \sum

A, B, C ⪰ 0 min u, v, w \sum

\displaystyle\sum_{u,v,w}D_{\text{KL}}\left(\text{Pois}\left(x;\mathcal{M}_{uvw}\right)\big{\|}\;\text{Pois}\left(x;\widetilde{\mathcal{M}}_{uvw}\right)\right).

\displaystyle\sum_{u,v,w}D_{\text{KL}}\left(\text{Pois}\left(x;\mathcal{M}_{uvw}\right)\big{\|}\;\text{Pois}\left(x;\widetilde{\mathcal{M}}_{uvw}\right)\right).

W_{s t} \leftarrow W_{s t} \frac{( Y H ^{⊤} ) _{s t}}{( W H H ^{⊤} ) _{s t}},

W_{s t} \leftarrow W_{s t} \frac{( Y H ^{⊤} ) _{s t}}{( W H H ^{⊤} ) _{s t}},

[[cc ∣ cc] 1 2 322222],

[[cc ∣ cc] 1 2 322222],

A = C = [1110], B = [22 - 1 1] .

A = C = [1110], B = [22 - 1 1] .

W, H ⪰ 0 min ∥ Ω * (Y - W H) ∥_{F}^{2}

W, H ⪰ 0 min ∥ Ω * (Y - W H) ∥_{F}^{2}

Ω_{uv} = {1, Y_{uv} \geq 0 0, Y_{uv} < 0 .

Ω_{uv} = {1, Y_{uv} \geq 0 0, Y_{uv} < 0 .

W_{uv} \leftarrow W_{uv} \frac{[ ( Ω * Y ) H ^{⊤} ] _{uv} + ϵ}{[ ( ( W H ) * Ω ) H ^{⊤} ] _{uv} + ϵ}, H_{uv} \leftarrow H_{uv} \frac{[ W ^{⊤} ( Ω * Y ) ] _{uv} + ϵ}{[ W ^{⊤} ( Ω * ( W H ) ) ] _{uv} + ϵ} .

W_{uv} \leftarrow W_{uv} \frac{[ ( Ω * Y ) H ^{⊤} ] _{uv} + ϵ}{[ ( ( W H ) * Ω ) H ^{⊤} ] _{uv} + ϵ}, H_{uv} \leftarrow H_{uv} \frac{[ W ^{⊤} ( Ω * Y ) ] _{uv} + ϵ}{[ W ^{⊤} ( Ω * ( W H ) ) ] _{uv} + ϵ} .

F (h) = ∥ ω * (v - W h) ∥_{F}^{2} .

F (h) = ∥ ω * (v - W h) ∥_{F}^{2} .

G (h, h^{t}) = F (h) + (h - h^{t})^{⊤} \nabla F (h^{t}) + \frac{1}{2} (h - h^{t})^{⊤} K (h - h^{⊤}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ZilongTan/ptpqp
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Tensor decomposition and applications · Bayesian Methods and Mixture Models

Full text

Efficient Learning of Mixed Membership Models

Zilong Tan

Department of Computer Science

Duke University

Email: [email protected]

Sayan Mukherjee

Departments of Statistical Science

Computer Science, Mathematics,

Biostatistics & Bioinformatics

Duke University

Email: [email protected]

Abstract

We present an efficient algorithm for learning mixed membership models when the number of variables $p$ is much larger than the number of hidden components $k$ . This algorithm reduces the computational complexity of state-of-the-art tensor methods, which require decomposing an $O\left(p^{3}\right)$ tensor, to factorizing $O\left(p/k\right)$ sub-tensors each of size $O\left(k^{3}\right)$ . In addition, we address the issue of negative entries in the empirical method of moments based estimators. We provide sufficient conditions under which our approach has provable guarantees. Our approach obtains competitive empirical results on both simulated and real data.

1 Introduction

Mixed membership models [36, 24, 25, 4, 11] have been used extensively across applications ranging from modeling population structure in genetics [24, 25] to topic modeling of documents [36, 4, 11]. Mixed membership models use Dirichlet latent variables to define cluster membership where samples can partially belong to each of $k$ latent components. Parameter estimation for such latent variables models (LVMs) using maximum likelihood methods such as expectation maximization is computationally intensive for large data, for example, if number of samples $n$ is large.

Parameter estimation using the method of moments for LVMs is an attractive scalable alternative that has been shown to have certain theoretical and computational advantages over maximum likelihood methods in the setting when $n$ is large. For LVMs, method of moments approaches reduce to tensor methods—the moments of the model parameters are expressed as a function of statistics of the observations in a tensor form. Inference in this setting becomes a problem of tensor factorization. Computational advantages of using tensor methods have been observed for many popular models, including latent Dirichlet allocation [1], spherical Gaussian mixture models [15], hidden Markov models [1], independent component analysis [8], and multi-view models [2]. An appealing property of tensor methods is the guarantee of a unique decomposition under mild conditions [19, 22].

There are two complications to using standard tensor decomposition methods [3, 2, 14, 20, 23, 17, 7] for LVMs. The first problem is computation and space complexity. Given $p$ variables in the LVM, parameter inference requires factorizing typically a non-orthogonal estimator tensor of size $O\left(p^{3}\right)$ [2, 20, 23], which is prohibitive for large $p$ . When the estimator is orthogonal and symmetric, this can be done in $O\left(p^{2}\log p\right)$ [34]. Online tensor decomposition [16] uses dimension reduction to instead factorize a reduced $k$ -by- $k$ -by- $k$ tensor. However, the dimension reduction can be slower than decomposing the estimator directly for large sample sizes, as well as suffer from high variance [34]. We introduce a simple factorization with improved complexity for the general case where the parameters are not required to be orthogonal.

The second problem arises from negative entries in the empirical moments tensor. LVMs for count data are constrained to have nonnegative parameters. However, the empirical moments tensor computed from the data may contain negative elements due to sampling variation and noise. Indeed, for small sample sizes or data with many small or zero counts, there will be many negative entries in the empirical moments tensor. General tensor decomposition algorithms [20, 23], including the tensor power method (TPM) [2], do not guarantee the nonnegativity of model parameters. Approaches such as positive/nonnegative tensor factorization [7, 31, 35] also do not address this situation as they require all the elements of the tensor to be factorized to be nonnegative. With robust tensor methods [3, 14], sparse negative entries may potentially be treated as corrupted elements; however, these methods are not applicable in this setting since there can be many negative elements.

In this paper, we introduce a novel parameter inference algorithm called partitioned tensor parallel quadratic programming (PTPQP) that is efficient in the setting where the number of variables $p$ is much larger than the number of latent components $k$ . The algorithm is also robust to negative entries in the empirical moments tensor. There are two key innovations in the PTPQP algorithm. The first innovation is a partitioning technique which recovers the parameters through factorizing $O\left(p/k\right)$ much smaller sub-tensors each of size $O\left(k^{3}\right)$ . This technique can also be combined with methods [34, 32, 16] to obtain further improved complexities. The second innovation is a parallel quadratic programming [5] based algorithm to factor tensors with negative entries under the constraint that the factors are all nonnegative. To the best of our knowledge, this is the first algorithm designed to address the problem of negative entries in empirical estimator tensors. We show that the proposed factorization algorithm converges linearly with respect to each factor matrix. We also provide sufficient conditions under which the partitioned factorization scheme is consistent, the parameter estimates converge to the true parameters.

2 Preliminaries

Notations

We use bold lowercase letters to represent vectors and bold capital letters for matrices. Tensors are denoted by calligraphic capital letters. The subscript notation $\bm{A}_{j}$ refers to $j$ -th column of matrix $\bm{A}$ . We denote the $j$ -th column of the identity matrix as $\bm{e}_{j}$ and $\bm{1}$ is a vector of ones. We further write $\text{diag}\left(\bm{x}\right)$ for a diagonal matrix whose diagonal entries are $\bm{x}$ , and $\text{diag}\left(\bm{A}\right)$ to mean a vector of the diagonal entries of $\bm{A}$ .

Element-wise matrix operators include $\succ$ and $\succeq$ , e.g., $\bm{A}\succeq 0$ means that $\bm{A}$ has nonnegative entries. $\left(\cdot\right)_{+}$ refers to element-wise $\max\left(\cdot,0\right)$ . $*$ and $\oslash$ respectively represent element-wise multiplication and division. Moreover, $\times$ refers to the outer product and $\odot$ denotes the Khatri-Rao product. $\left\|\cdot\right\|_{F}$ and $\left\|\cdot\right\|_{2}$ represent the Frobenius norm and spectral norm, respectively.

Tensor basics

This paper uses similar tensor notations as [18]. In particular, we are primarily concerned with Kruskal tensors in $\mathbb{R}^{d_{1}\times d_{2}\times d_{3}}$ , which can be expressed in the form of

[TABLE]

where $\bm{A}$ , $\bm{B}$ , and $\bm{C}$ are respectively $d_{1}$ -by- $r$ , $d_{2}$ -by- $r$ , and $d_{3}$ -by- $r$ factor matrices. The rank of $\mathcal{T}$ is defined as the smallest $r$ that admits such a decomposition. The decomposition is known as the CP (CANDECOMP/PARAFAC) decomposition. The $j$ -mode unfolding of $\mathcal{T}$ , denoted by $\mathcal{T}_{\left(j\right)}$ , for $j=1,2,3$ is a $d_{j}$ -by- $\left(\prod_{t\neq j}d_{t}\right)$ matrix whose rows are serializations of the tensor fixing the index of the $j$ -th dimension. The unfoldings have the following well-known compact expressions:

[TABLE]

3 Learning through Method of Moments

3.1 Generalized Dirichlet latent variable models

A generalized Dirichlet latent variable model (GDLM) was proposed in [39] for the joint distribution of $n$ observations $\bm{y}_{1},\bm{y}_{2},\cdots,\bm{y}_{n}$ . Each observation $\bm{y}_{i}$ consists of $p$ variables $\bm{y}_{i}=\left(y_{i1},y_{i2},\cdots,y_{ip}\right)^{\top}$ . GDLM assumes a generative process involving $k$ hidden components. For each observation, sample a random Dirichlet vector $\bm{x}_{i}=\left(x_{i1},x_{i2},\cdots,x_{ik}\right)^{\top}\in\Delta^{k-1}$ with concentration parameter $\bm{\alpha}=\left(\alpha_{1},\alpha_{2},\cdots,\alpha_{k}\right)^{\top}$ . The elements of $\bm{x}_{i}$ are the membership probabilities for $\bm{y}_{i}$ to belong to each of the $k$ components. Specifically,

[TABLE]

where $g_{j}\left(\theta_{jh}\right)$ is the density of the $j$ -th variable specific to component $h$ with parameter $\bm{\theta}_{j}=\left(\bm{\theta}_{j1},\bm{\theta}_{j2},\cdots,\bm{\theta}_{jk}\right)$ . One advantage of GLDM is that $y_{ij}$ can take categorical values. Let $d_{j}$ denote the number of categories for the $j$ -th variable (set $d_{j}=1$ for scalar variables), $\bm{\theta}_{j}$ becomes a $d_{j}$ -by- $k$ probability matrix where the $c$ -th row corresponds to category $c$ . We aim to accurately recover $\bm{\theta}_{j}$ from independent copies of $\bm{y}_{i}$ involving variables of mixed data types, either categorical or non-categorical.

3.2 Moment-based estimators

The moment estimators of latent variable models typically take the form of a tensor [2]. Consider the estimators of GDLM [39] for example. Let $\bm{b}_{ij}=\bm{e}_{y_{ij}}$ if variable $j$ is categorical; $\bm{b}_{ij}=y_{ij}$ otherwise. The second- and third- order parameter estimators for variable $j$ , $s$ , and $t$ are written

[TABLE]

Alternatively, $\mathcal{M}^{js}$ and $\mathcal{M}^{jst}$ have the following CP decomposition into parameters $\bm{\theta}_{j}$ :

[TABLE]

Appendix A provides the derivation details of these estimators. For the special case of latent Dirichlet allocation, $\mathcal{M}^{js}$ and $\mathcal{M}^{jst}$ are scalar joint probabilities.

The parameters $\bm{\theta}_{j}$ are typically obtained by factorizing the block tensor $\mathcal{M}_{2}$ whose $\left(j,s\right)$ -th element is the empirical $\widehat{\mathcal{M}}^{js}$ and/or $\mathcal{M}_{3}$ whose $\left(j,s,t\right)$ -th element is the empirical $\widehat{\mathcal{M}}^{jst}$ [3, 2, 39]. Note that $\bm{\theta}_{j}$ are generally non-orthogonal, and thus preprocessing steps (see Appendix B) are needed for orthogonal decomposition methods [34, 32, 2]. The preprocessing can be expensive and often leads to suboptimal performance [33, 23]. Here, we highlight a few relevant observations:

•

$\mathcal{M}^{js}$ alone does not yield unique parameters $\bm{\theta}_{j}$ due to the well-known rotation problem. Suppose that $\bm{\theta}_{j}^{*}$ and $\bm{\theta}_{s}^{*}$ are the ground-truth parameters satisfying (3) and any invertible $\bm{R}$ , there exists decomposition $\bm{\theta}_{j}^{\prime}=\bm{\theta}_{j}^{*}\bm{R}$ and $\bm{\theta}_{s}^{\prime}=\bm{R}^{-1}\bm{\theta}_{s}^{*}$ that also satisfy (3) but are not ground-truth parameters. The ground-truth parameters are not uniquely identifiable through $\mathcal{M}^{js}$ , this is true even when enforcing nonnegativity constraints on parameters [10].

•

$\mathcal{M}^{jst}$ is sufficient to uniquely recover the parameters under certain mild conditions [19]; for example, when any two of $\bm{\theta}_{j}$ , $\bm{\theta}_{s}$ , and $\bm{\theta}_{t}$ have linearly independent columns and the columns of the third are pair-wise linearly independent [22].

•

The empirical estimator $\mathcal{\widehat{M}}^{jst}$ generally contains negative entries due to variance and noise. The fraction of negative entries can approach 50%, as we shall see in experiments. We address this issue in § 4.4.

•

While the decomposition (4) can be unique up to permutation and rescaling, the correspondence between each column of the factor matrix and each hidden component may not be consistent across multiple decompositions. Techniques for achieving consistency are developed in § 4.2.

3.3 Computational complexity

Tensor methods such as TPM typically decompose the $O\left(p^{3}d_{max}^{3}\right)$ full estimator tensor that includes all variables. More efficient algorithms have been developed for the case that parameters are orthogonal [34, 32], and when the sample size is small [16]. However, these methods do not apply in the general case where the parameters are non-orthogonal and the sample size can be potentially large. A key insight underlying our approach is that it is sufficient to recover the parameters by factorizing only $O\left(p/k\right)$ much smaller sub-tensors each of size $O\left(k^{3}\right)$ . This technique can also be combined with the aforementioned methods to further improve the complexity in certain cases.

4 An efficient algorithm

In this section, we develop partitioned tensor parallel quadratic programming (PTPQP) an efficient approximate algorithm for learning mixed membership models. We first introduce a novel partitioning-and-matching scheme that reduces parameter estimation to factorizing a sequence of sub-tensors. Then, we develop a nonnegative factorization algorithm that can handle negative entries in the sub-tensors.

4.1 Partitioned factorization

Factorizing the full tensor formed by all $\mathcal{M}^{jst}$ is expensive while a three-variable tensor $\mathcal{M}^{jst}$ in (4) alone may not be sufficient to determine $\bm{\theta}_{j}$ when $k$ is large. In this section, we consider factorizing the sub-tensors corresponding to a cover of the set of variables $\left[p\right]$ such that each sub-tensor admits an identifiable CP decomposition (1), i.e. unique up to permutation and rescaling of columns. This gives the parameters for all variables. Suppose that $p>k$ and the maximum number of categories $d_{\text{max}}$ is a constant, the aggregated size of the sub-tensors can be much smaller, i.e., $O\left(pk^{2}\right)$ , than the size $O\left(p^{3}\right)$ of the full estimator.

Let $\pi^{j}$ , $\pi^{s}$ , and $\pi^{t}$ denote ordered subsets $\subseteq\left[p\right]$ , with cardinality $\left|\pi_{j}\right|=p_{j}$ , $\left|\pi_{s}\right|=p_{s}$ , and $\left|\pi_{t}\right|=p_{t}$ , respectively. Consider the $p_{j}$ -by- $p_{s}$ -by- $p_{t}$ block tensor 111For block tensor operations, see e.g., [26]. $\mathcal{M}^{\pi^{j}\pi^{s}\pi^{t}}$ whose $\left(u,v,w\right)$ -th element is the tensor $\mathcal{M}_{uvw}^{\pi^{j}\pi^{s}\pi^{t}}=\mathcal{M}^{\pi_{u}^{j}\pi_{v}^{s}\pi_{w}^{t}}$ . From (4), we have that

[TABLE]

Clearly, the block tensor is identifiable if it has an identifiable sub-tensor. Suppose that a sub-tensor $\mathcal{\bm{M}}^{\pi^{u}\pi^{v}\pi^{w}}$ is identifiable, then one can construct an identifiable tensor $\mathcal{\bm{M}}^{\pi^{j\prime}\pi^{s\prime}\pi^{t\prime}}$ from $\mathcal{\bm{M}}^{\pi^{j}\pi^{s}\pi^{t}}$ by setting

[TABLE]

We further remark that a sub-tensor can be identifiable under mild conditions, for example, if the sum of the Kruskal rank of the three factor matrices is at least than $2k+2$ [19].

Given an identifiable sub-tensor $\mathcal{M}^{\pi^{u}\pi^{v}\pi^{w}}$ of anchor variables indexed by $\pi^{u}$ , $\pi^{v}$ , and $\pi^{w}$ , the partitioning produces a set of sub-tensors (partitions) constructed through (6), that includes all variables. Thus, $\mathcal{M}^{\pi^{u}\pi^{v}\pi^{w}}$ is a common sub-tensor shared across all partitions. We choose anchor variables whose parameter matrices are of full column rank to obtain an identifiable $\mathcal{M}^{\pi^{u}\pi^{v}\pi^{w}}$ . Finally, one can divide the rest of variables evenly and randomly into the partitions.

4.2 Matching parameters with hidden components

Since the factorization of a partition (5) can only be identifiable up to permutation and rescaling of the columns of constituent $\bm{\theta}_{j}$ , the correspondence between the columns of $\bm{\theta}_{j}$ and hidden components can differ across partitions. To enforce consistency, we associate a permutation operator $\psi^{j}$ for each variable $j$ such that $\left(\psi^{j}\bm{\theta}_{j}\right)_{h}$ are the parameters specific to hidden component $h$ across all variables $j$ . Consider the following vector representation of $\psi$ :

[TABLE]

Observe that $\psi^{j}=\psi^{s}=\psi^{t}$ within a factorization of $\mathcal{M}^{jst}$ , and this also holds for the partitioned factorization (5) of $\mathcal{\mathcal{M}}^{\pi^{j}\pi^{s}\pi^{t}}$ as well, i.e., $\psi^{x}=\psi^{y},\quad\forall x,y\in\pi^{j}\cup\pi^{s}\cup\pi^{t}.$

Consider the factorizations of $\mathcal{\mathcal{M}}^{\pi^{j}\pi^{s}\pi^{t}}$ and $\mathcal{\mathcal{M}}^{\pi^{u}\pi^{v}\pi^{w}}$ and suppose that $\exists x\in\left(\pi^{j}\cup\pi^{s}\cup\pi^{t}\right)\cap\left(\pi^{u}\cup\pi^{v}\cup\pi^{w}\right)$ . The permutation operator for one factorization is determined given the other by column matching the parameters of variable $x$ in both factorizations. Thus, an inductive way to achieve a consistent factorization is to start with one factorization, and let its permutation be the identity $\left(1,2,\cdots,k\right)$ , then perform the factorization over new sets of variables with at least one variable in common with the initial factorization. Permutations for the sequential factorizations are determined via column matching parameter matrices of the common variables.

Given two factorized parameter matrices $\bm{\theta}_{j}$ and $\bm{\theta}_{j}^{\prime}$ of variable $j$ , our goal is to find a consistent permutation $\psi$ (of $\bm{\theta}_{j}$ with respect to $\bm{\theta}_{j}^{\prime}$ ) such that $\left(\psi\bm{\theta}_{j}\right)_{h}$ and $\bm{\theta}_{jh}^{\prime}$ correspond to the same hidden component for all $h\in\left[k\right]$ . We now present an algorithm with provable guarantees to compute a consistent permutation.

Smallest angle matching

A simple matching algorithm is to match the two columns of the two parameter matrices that have the smallest angle between them. Consider the factorizations of $\mathcal{M}^{jst}$ and $\mathcal{M}^{juv}$ which yield respectively parameters $\bm{\theta}_{j}$ and $\bm{\theta}_{j}^{\prime}$ for the common variable $j$ . Given the permutation $\psi^{j}$ for $\mathcal{M}^{jst}$ , the permutation $\psi^{u}$ for $\mathcal{M}^{juv}$ is computed by:

[TABLE]

Here, $\bm{\bar{\theta}}_{j}$ and $\bm{\bar{\theta}}_{j}^{\prime}$ represent respectively the normalized $\bm{\theta}_{j}$ and $\bm{\theta}_{j}^{\prime}$ with each column having unit Euclidean norm.

There are cases that $\psi^{u}$ computed via (7) is not consistent: 1) $\psi^{u}$ contains duplicate entries and hence is ineligible; and 2) since $\bm{\theta}_{j}$ and $\bm{\theta}_{j}^{\prime}$ are the factorized parameter matrices which are generally perturbed from the ground-truth, the resulting $\psi^{u}$ may differ from the consistent permutation. To cope with these cases, we establish in § 5 the sufficient conditions for $\psi^{u}$ to be consistent.

Orthogonal Procrustes matching

One issue with the smallest angle matching is that each column is paired independently. It is easy for multiple columns to be paired with a common nearest neighbor. We describe a more robust algorithm based on the orthogonal Procrustes problem, and show improved guarantees. Since a consistent permutation is orthogonal, a natural relaxation is to only require the operator to be orthogonal. This is an orthogonal Procrustes problem, formulated in the same settings as § 4.2

[TABLE]

Let $\bm{\bar{\theta}}_{j}^{\prime\top}\psi^{j}\bm{\bar{\theta}}_{j}=\bm{U}\bm{\Sigma}\bm{V}^{\top}$ be the singular value decomposition (SVD), the solution $\bm{\Psi}^{*}$ is given by the polar factor [29]

[TABLE]

Here, $\bm{\Psi}^{*}$ is orthogonal and does not immediately imply the desired permutation $\psi^{u}$ . To compute $\psi^{u}$ , one can additionally restrict $\bm{\Psi}$ to be a permutation matrix, and solve for $\psi^{u}$ using linear programming [13]. Aside from efficiency, one fundamental question is that under what assumptions the objective (8) yields the consistent permutation.

Given the solution $\bm{\Psi}^{*}$ to the Procrustes problem, we propose the following simple algorithm for computing $\psi^{u}$ :

[TABLE]

We first establish through Theorem 1 that if $\psi^{u}$ obtained using (10) is a valid permutation, i.e., no duplicate entries, then it is optimal in terms of the objective (8).

Theorem 1.

The $\psi^{u}$ obtained using $\eqref{eq:opm}$ satisfies

[TABLE]

for all permutations $\psi$ .

Proof.

First, rewrite the objective (8) as follows

[TABLE]

Recall the SVD $\bm{\bar{\theta}}_{j}^{\prime\top}\psi^{j}\bm{\bar{\theta}}_{j}=\bm{U}\bm{\Sigma}\bm{V}^{\top}$ , and write $\bm{\Psi}=\bm{U}\bm{V}^{\top}+\bm{E}$ . Keeping only terms that depend on $\bm{E}$ in (11) to obtain $-2\operatorname{tr}\bm{E}^{\top}\bm{U}\bm{\Sigma}\bm{V}^{\top}$ . Thus, the optimization (8) is equivalent to

[TABLE]

From the constraint, we obtain $\operatorname{tr}\bm{E}^{\top}\bm{E}=-2\operatorname{tr}\bm{V}^{\top}\bm{E}^{\top}\bm{U}$ . The optimization now becomes

[TABLE]

Let us now restrict each column of $\bm{\Psi}$ to be in $\left\{\bm{e}_{j}\;|\;j=1,2,\cdots,k\right\}$ , but not necessarily distinct. Suppose that $\bm{\Psi}_{j}=\bm{e}_{y}$ . We have that $\bm{E}_{j}=\bm{e}_{y}-\left(\bm{U}\bm{V}^{\top}\right)_{j}$ . Clearly, (12) and hence (8) are minimized with $y=\arg\max_{t}\left(\bm{U}\bm{V}^{\top}\right)_{tj}$ . ∎

In section § 5 we state sufficient conditions under which the objective (8) yields a consistent permutation.

4.3 Approximate nonnegative factorization

In previous sections, we reduced the inference problem to factorizing partitioned sub-tensors. We now present a factorization algorithm for the sub-tensors that contain negative entries. Our goal is to approximate a sub-tensor $\mathcal{M}$ by a sub-tensor $\widetilde{\mathcal{M}}=\sum_{j}\bm{A}_{j}\times\bm{B}_{j}\times\bm{C}_{j}$ where the factors $\bm{A}$ , $\bm{B}$ , and $\bm{C}$ are nonnegative. The Frobenius norm is used to quantify the approximation

[TABLE]

Note that we do not assume that $\mathcal{M}\succeq 0$ in (13) which distinguishes our optimization problem from other approximate factorization algorithms [35, 7, 17, 31]. In § 4.4, we provide some details as to why negative entries are problematic for standard approximate factorization algorithms. We can rewrite (13) using the 1-mode unfolding as

[TABLE]

Equivalent formulations with respect to the 2-mode and 3-mode unfoldings can be readily obtained from (2).

We point out that another widely-used error measure — the I-divergence [12, 7] — may not be suitable for our learning problem. The optimization using I-divergence is given by

[TABLE]

This optimization is useful for nonnegative $\mathcal{M}$ when each entry follows a Poisson distribution. In this case, the objective is equivalent to the sum of Kullback-Leibler divergence across all entries of $\mathcal{M}$ :

[TABLE]

However, the Poisson assumption does not generally hold for the estimator tensor (4).

4.4 Handling negative entries in empirical estimators

We first illustrate that factorizing a tensor with negative entries using either positive tensor factorization [35] or nonnegative tensor factorization [7, 31] will either result in factors that violate the the nonnegativity constraint or the result of the algorithm diverges. In addition, we show that general tensor decompositions cannot enforce the factor nonnegativity even after rounding the negative entries to zero.

We then present a simple method based on weighted nonnegative matrix factorization (WNMF) [37] that enforce the factor nonnegativity constraint. We further generalize this method using parallel quadratic programming (PQP) [5] to obtain a method with a provable convergence rate.

Issue of negative entries

If the tensor is strictly nonnegative, the optimization specified in (13) can be reduced to nonnegative matrix factorization (NMF). Solvers abound for NMF including the celebrated Lee-Seung’s multiplicative updates [21]. The reduction is done by viewing (14) as $\left\|\bm{Y}-\bm{W}\bm{H}\right\|_{F}^{2}$ with $\bm{Y}=\mathcal{M}_{\left(1\right)}^{jst}$ , $\bm{W}=\bm{A}$ , and $\bm{H}=\left(\bm{C}\odot\bm{B}\right)^{\top}$ , and alternating

[TABLE]

over each unfolding and factor matrix $\bm{W}$ . Obviously, the updates may yield negative entries in $\bm{W}$ when the unfolding contains negative entries. In addition, convergence relies on the nonnegativity of the unfolding [21]. This issue extends to their tensor factorization variants [35, 7, 17] known as the positive tensor factorization and nonnegative tensor factorization. For these approaches, a naive resolution is to round negative entries of $\mathcal{\widehat{M}}^{jst}$ to [math], this however lacks theoretical guarantees.

It is important to note that the rounding does not help general tensor decompositions like TPM. The following example illustrates that the unique decomposition (up to permutation and rescaling) of a positive tensor can contain negative entries. Consider a $2$ -by- $2$ -by- $2$ positive tensor, whose $1$ -mode unfolding is given by

[TABLE]

where the vertical bar separates two frontal slices. It has the following decomposition, written in the form of (1):

[TABLE]

Since all factors are of full-rank, the decomposition is unique up to permutation and rescaling of columns [19]. Thus, a general tensor decomposition yields a $\bm{B}$ with negative entries regardless of rescaling.

4.5 Factorization via WNMF

Since the ground-truth $\mathcal{M}^{jst}$ are nonnegative, we may “ignore” the negative entries of $\widehat{\mathcal{M}}^{jst}$ by treating them as missing values. This idea leads to the following modified objective:

[TABLE]

where $\bm{Y}$ , $\bm{W}$ , $\bm{H}$ are chosen identically as (15), and we define

[TABLE]

The optimization can be carried out using WNMF. Here, we modify the original updates by introducing a positive constant $\epsilon$ to ensure that the updates are well-defined:

[TABLE]

Theorem 2 states the correctness of the modified updates (17).

Theorem 2.

The objective (16) is non-increasing under the multiplicative updates (17).

Proof.

We prove the update for $\bm{H}$ , and the update for $\bm{W}$ follows by applying the update to $\left\|\bm{\Omega}^{\top}*\left(\bm{v}^{\top}-\bm{H}^{\top}\bm{W}^{\top}\right)\right\|_{F}$ . First, consider the error Frobenius norm for a column $\bm{h}$ of $\bm{H}$ , and the corresponding columns $\bm{\omega}$ of $\bm{\Omega}$ and $\bm{v}$ of $\bm{V}$ ,

[TABLE]

The following $G\left(\cdot,\cdot\right)$ is an auxiliary function of $F\left(\cdot\right)$ :

[TABLE]

where we define

[TABLE]

Clearly, $G\left(\bm{h},\bm{h}\right)=F\left(\bm{h}\right)$ , and one can show that $G\left(\bm{h},\bm{h}^{t}\right)\geq F\left(\bm{h}\right)$ by rewriting

[TABLE]

where we note that $\bm{\omega}*\bm{\omega}=\bm{\omega}$ from the Boolean definition of $\bm{\omega}$ . Comparing (18) with (19), it is sufficient to show that $\bm{K}-\bm{W}^{\top}\text{diag}\left(\bm{\omega}\right)\bm{W}$ is positive semi-definite. Now consider the scaled matrix

[TABLE]

Observe that $\bm{U}$ is strictly diagonally dominant as $\bm{U}\bm{1}\succ 0$ and the off-diagonal entries are negative. Also note that all diagonal entries of $\bm{U}$ are positive, it follows that $\bm{U}$ is positive semi-definite. We thereby conclude that $\bm{K}-\bm{W}^{\top}\text{diag}\left(\bm{\omega}\right)\bm{W}$ is positive semi-definite.

Let $\bm{h}^{t+1}=\arg\min_{\bm{h}}G\left(\bm{h},\bm{h}^{t}\right)$ , we have that $F\left(\bm{h}^{t}\right)=G\left(\bm{h}^{t},\bm{h}^{t}\right)\geq G\left(\bm{h}^{t+1},\bm{h}^{t}\right)\geq F\left(\bm{h}^{t+1}\right)$ . The minimizer $\bm{h}^{t+1}$ is obtained by setting $\nabla_{\bm{h}^{t+1}}G\left(\bm{h}^{t+1},\bm{h}^{t}\right)=0$ , which yields

[TABLE]

The particular choice of $\bm{\Omega}$ guarantees that $\bm{h}^{t+1}$ is always positive. ∎

4.6 Parallel quadratic programming

We now generalize the WNMF approach using parallel quadratic programming to obtain a convergence rate. Let $\mathbb{S}_{++}$ denote the set of symmetric positive definite matrices, we consider the following optimization problem

[TABLE]

which can be solved by iterating multiplicative updates [5, 30]. We use the parallel quadratic programming (PQP) algorithm [5, 6] to solve (20), partly because it has a provable linear convergence rate. The PQP multiplicative update for (20) takes the following simple form:

[TABLE]

with

[TABLE]

Here $\bm{\gamma}$ and $\bm{\phi}$ are arguments to PQP, we will discuss these arguments in section § 5.2. The update maintains nonnegativity since all items are nonnegative. We make the following observation.

Theorem 3.

The multiplicative updates for Lee-Seung and WNMF are special cases of PQP.

Proof.

Since the WNMF (16) generalizes Lee-Seung, which is the case that $\bm{\Omega}$ has all ones, we need only to prove for WNMF. Let $\bm{\Lambda}=\bm{\Omega}*\bm{\Omega}$ and $\bm{\gamma}=0$ , some matrix algebra reveals the following PQP updates

[TABLE]

Comparing (22) to (17), they are equivalent if $\Phi_{uv}=\Phi_{uv}^{\prime}=\epsilon$ . ∎

We can now solve the approximate nonnegative factorization problem stated in (13) using (21). Theorem 4 states the multiplicative updates. A more detailed discussion of $\bm{\Phi}$ is included in § 5.2. We present pseudo-code in Algorithm 1.

Theorem 4.

For optimization (13), the following update converges linearly to a local optimum

[TABLE]

with

[TABLE]

where $\lambda_{\text{min}}\left(\cdot\right)$ is the smallest eigenvalue. Similar updates for $\bm{B}$ and $\bm{C}$ are obtained using (2).

Proof.

We apply PQP updates (21) to each row of $\bm{A}$ . Let $\bm{v}_{j:}$ and $\bm{A}_{j:}$ be the $j$ -th row of $\mathcal{M}_{\left(1\right)}$ and $\bm{A}$ , respectively. Fixing the current factor estimates $\bm{B}$ and $\bm{C}$ , the optimization with respect to $\bm{A}_{j:}$ follows from (14):

[TABLE]

Now the updates (21) can be applied immediately, where we set $\bm{\gamma}=0$ and $\bm{\Phi}$ according to Theorem 7 in § 5. Using the identity $\left(\bm{C}\odot\bm{B}\right)^{\top}\left(\bm{C}\odot\bm{B}\right)=\left(\bm{C}^{\top}\bm{C}\right)*\left(\bm{B}^{\top}\bm{B}\right)$ and performing the updates simultaneous for all rows of $\bm{A}$ give (23).

∎

4.7 Proposed approach

To summarize, the proposed approach, referred to as PTPQP, consists of three steps. Given the indexes of anchor variables $\pi^{u}\cup\pi^{v}\cup\pi^{w}$ , the variables $\left[p\right]\backslash\left(\pi^{u}\cup\pi^{v}\cup\pi^{w}\right)$ are first evenly divided into $r$ partitions, and the anchor variables are added to each partition. The second step consists of forming and factorizing the sub-tensor of each partition using Algorithm 1, this step can be parallelized. Third, normalize the anchor matrix $\left[\bm{\theta}^{\pi^{u}\top},\bm{\theta}^{\pi^{v}\top},\bm{\theta}^{\pi^{w}\top}\right]^{\top}$ formed by the anchor variable parameters to have unit column Euclidean norm, and then use either (7) or (10) to match over the anchor matrix.

Efficiency

Most of the computational cost is in the factorization. Consider one partition, and let $\mathcal{M}^{\pi_{j}\pi_{s}\pi_{t}}$ be the corresponding sub-tensor, the sub-tensor size is $\prod\limits_{\pi\in\left\{\pi^{j},\pi^{s},\pi^{t}\right\}}\sum_{h\in\pi}d_{h}$ . The maximum number of categories for a variable is generally a constant for the GDLM. Under smallest partitioning, this size is determined by the sub-tensor of anchor variables, i.e., $O\left(k^{3}\right)$ , which corresponds to $\left(p/k\right)$ partitions. One benefit of PTPQP is that the number of sub-tensor factorizations is linear in $p$ due to the partitioned factorization, this results in significant efficiency gains when $p\gg k$ . Furthermore, PTPQP is easy to be parallelized across multiple CPUs and machines, since the computation as well as data are not distributed across partitions.

5 Provable Guarantees

In this section, we state the main theoretical results of the proposed partitioned factorization and tensor PQP factorization.

5.1 Sufficient conditions for guaranteed matching

Theorem 5 and Theorem 6 state that when the anchor parameter matrices from two factorizations are “close”, the proposed matching algorithms obtain a consistent permutation.

Theorem 5.

Suppose that $\bm{\theta}_{j}$ is the ground-truth matrix for variable $j$ . Solving (7) results in a consistent permutation if for all factors $\widehat{\bm{\theta}}_{j}$ of variable $j$

[TABLE]

for all $h\in\left[k\right]$ , where $\bm{\bar{\theta}}_{jh}=\bm{\theta}_{jh}/\left\|\bm{\theta}_{jh}\right\|_{2}$ .

Proof.

Consider the smallest pair-wise angle $\alpha_{\text{min}}$ between the columns of $\bm{\bar{\theta}}_{j}$ , we have that

[TABLE]

Denote by $\alpha$ the maximum angle between the column of a factorized parameter matrix $\widehat{\bm{\theta}}_{j}$ and the corresponding column of the ground-truth. It is sufficient to ensure that

[TABLE]

Consider any two columns $s\neq t$ of the ground-truth parameter matrix, and the corresponding perturbed columns $\left\{\widehat{\bm{\theta}}_{js},\widehat{\bm{\theta}}_{jt}\right\}$ and $\left\{\widehat{\bm{\theta}}_{js}^{\prime},\widehat{\bm{\theta}}_{jt}^{\prime}\right\}$ from two factorizations. We have that

[TABLE]

From (24), we have that

[TABLE]

as desired for (7) to work correctly. Now consider the inner product of a perturbed column and the ground-truth, it holds that

[TABLE]

Thus, a sufficient condition for (7) to yield the consistent permutation is

[TABLE]

which written in analytic form proves the theorem. ∎

Theorem 5 states that one obtains a consistent permutation by solving (7) in the columns of the ground-truth parameter matrix are distinct from each other in angles and the factorized parameter matrix is near the ground-truth in Frobenius norm. Thus, a good anchor variable for the partitioned factorization (5) is one whose parameter matrix has distant columns in angles.

The bound in Theorem 5 can be made sharp for certain $\bm{\theta}_{j}$ , and thus the smallest angle matching algorithm has general guarantees only when the perturbation is small, i.e., the relative error ratio is less than $1-\sqrt{2+\sqrt{2}}/2\approx 1/13$ .

Theorem 6.

Suppose that $\bm{\theta}$ and $\bm{\theta}^{\prime}$ are two factorized parameter matrices for a variable. Solving (10) results in a consistent permutation $\psi$ , if

[TABLE]

with

[TABLE]

where the error matrix is define as $\bm{E}=\left(\psi\bm{\theta}\right)^{\top}\left(\bm{\theta}^{\prime}-\psi\bm{\theta}\right)$ , and $\sigma_{j}\left(\cdot\right)$ denotes the $j$ -th largest singular value.

The proof of Theorem 6 follows from the following two Lemmas.

Lemma 1.

Suppose that $\psi$ is the consistent permutation of $\bm{\theta}$ with respect to $\bm{\theta}^{\prime}$ . Formula (10) is guaranteed to recover $\psi$ , if

[TABLE]

where $\bm{U}$ and $\bm{V}$ are the left and right singular matrices of $\left(\psi\bm{\theta}\right)^{\top}\bm{\theta}^{\prime}$ .

Proof.

We need to show that (10) yields $\psi$ for the orthogonal Procrustes problem $\min\limits_{\bm{\Psi}^{\top}\bm{\Psi}=\bm{I}}\left\|\bm{\theta}\bm{\Psi}-\bm{\theta}^{\prime}\right\|_{F}$ . From the solution (9), it is easy to show that the minimizer $\bm{\Psi}^{*}$ of $\min\limits_{\bm{\Psi}^{\top}\bm{\Psi}=\bm{I}}\left\|\bm{\theta}^{\prime}\bm{\Psi}-\bm{\theta}\right\|_{F}$ and the minimizer $\bm{\Psi}^{\prime}$ of $\min\limits_{\bm{\Psi}^{\top}\bm{\Psi}=\bm{I}}\left\|\left(\psi\bm{\theta}\right)\bm{\Psi}-\bm{\theta}^{\prime}\right\|_{F}$ satisfy

[TABLE]

Note that $\bm{\Psi}^{*\top}$ is the desired minimizer of $\min\limits_{\bm{\Psi}^{\top}\bm{\Psi}=\bm{I}}\left\|\bm{\theta}\bm{\Psi}-\bm{\theta}^{\prime}\right\|_{F}$ , and thus it remains to show that (10) gives $\psi$ when applied to $\bm{\Psi}^{*\top}$ , or equivalently

[TABLE]

Since the row and column vectors of $\bm{\Psi}^{*}$ have unit Euclidean norm, the following dual statements imply each other

[TABLE]

if condition $\eqref{eq:procrustes-dom}$ holds. Under this condition, we also have that (10) gives the identity permutation $\left[1,2,\cdots,k\right]$ for the orthogonal Procrustes problem $\min\limits_{\bm{\Psi}^{\top}\bm{\Psi}=\bm{I}}\left\|\left(\psi\bm{\theta}\right)\bm{\Psi}-\bm{\theta}^{\prime}\right\|_{F}$ . Thus, applying (10) to both sides of (26) yields

[TABLE]

which implies (27) from (28). ∎

Lemma 2.

*(Mathias)

Suppose that $\bm{A}\in\mathbb{R}^{n\times n}$ is nonsingular. Then for any $\bm{E}\in\mathbb{R}^{n\times n}$ with $\sigma_{1}\left(\bm{E}\right)<\sigma_{n}\left(\bm{A}\right)$ and any unitarily invariant norm $\left\|\cdot\right\|$ , it holds that*

[TABLE]

where $\mu\left(\cdot\right)$ represents the unitary factor of the polar decomposition, and ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{k}$ is the Ky Fan $k$ -norm.

Proof of Theorem 6.

Let $\bm{H}=\left(\psi\bm{\theta}\right)^{\top}\psi\bm{\theta}$ , which has the same singular values as $\bm{\theta}^{\top}\bm{\theta}$ . Denote by $\mu\left(\cdot\right)$ the unitary factor of the polar decomposition. Using the fact that $\mu\left(\bm{H}\right)=\bm{I}$ , the sufficient condition of Lemma 1 is restated as

[TABLE]

Also note that

[TABLE]

Thus, it suffices to enforce the right term to be less than $1-\sqrt{2}/2$ . From Lemma 2, this can be achieved by letting

[TABLE]

∎

The first condition in Theorem 6 requires that at least one of $\bm{\theta}$ and $\bm{\theta}^{\prime}$ must have full column rank. We may exchange $\bm{\theta}$ and $\bm{\theta}^{\prime}$ in Theorem 6 to first obtain the consistent permutation of $\bm{\theta}^{\prime}$ with respect to $\bm{\theta}$ , $\psi$ then follows immediately.

Theorem 6 states that solving (10) recovers a consistent permutation whenever the error spectral norm is small as compared to the smallest singular value of $\bm{\theta}^{\top}\bm{\theta}$ . This is especially useful for $\bm{\theta}\in\mathbb{R}^{d\times k}$ with the number of rows $d$ much larger than the number of columns $k$ . In particular, for $\bm{\theta}$ with independent and identically distributed subgaussian entries, $\sigma_{k}\left(\bm{\theta}^{\top}\bm{\theta}\right)$ is at least of the order $\left(\sqrt{d}-\sqrt{k-1}\right)^{2}$ [28].

5.2 Convergence

The following theorem states a sufficient condition for PQP to achieve linear convergence rate. The theorem statement and proof is an adaptation of results stated in [5]—the proof in [5] overlooks a required condition on $\bm{\phi}$ and the condition $\bm{\gamma}\succeq\text{diag}\left(Q_{jj}\right)$ in the original proof is unnecessary.

Theorem 7.

The PQP algorithm given by (21) monotonically decreases the objective (20) and has linear convergence, if

[TABLE]

where $\lambda_{\text{min}}\left(\cdot\right)$ is the smallest eigenvalue.

Proof.

First, the condition $\bm{\gamma}\geq\left(-\bm{Q}\right)_{+}\bm{1}$ suffices to ensure that the updates monotonically decrease (20) [6]. Thus, it remains to show the condition on $\bm{\phi}$ . Suppose that the $i$ -th element of the optimum $\bm{x}^{*}$ is perturbed by a non-zero $\epsilon>-x_{i}^{*}$ . Let $\bm{x}=\bm{x}^{*}+\epsilon\bm{e}_{i}$ , and applying one update gives $\bm{x}^{\prime}$ . Denote the $i$ -th row of $\bm{Q}^{+}$ , $\bm{Q}^{-}$ , and $\bm{Q}$ respectively by $\bm{P}_{i}$ , $\bm{N}_{i}$ , and $\bm{Q}_{i}$ , then it holds that $\bm{P}_{i}\bm{e}_{i}=Q_{ii}+\gamma_{i}$ and $\bm{N}_{i}\bm{e}_{i}=\gamma_{i}$ by definition. We now consider the ratio of errors between successive iterations:

[TABLE]

From the KKT first-order optimality condition $x_{i}^{*}\left(\bm{Q}_{i}\bm{x}^{*}+z_{i}\right)=0$ , we simplify the ratio as

[TABLE]

Observe that the denominator is nonnegative. We also have that the denominator is greater than the numerator using the KKT optimality condition $\bm{Q}_{i}\bm{x}^{*}+z_{i}\geq 0$ :

[TABLE]

To achieve linear convergence rate, we may enforce the ratio to be less than one. Equivalently,

[TABLE]

It suffices to set

[TABLE]

To get rid of $\bm{x}^{*}$ in (31), we have the following inequality

[TABLE]

where the right term is the negative of the minimum of the unconstrained problem, assuming that ${Q}$ is non-singular. If $\bm{Q}$ is singular, then $\bm{x}^{*}$ can be unbounded. Further simplify the inequality using KKT optimality conditions as

[TABLE]

Combining with (31) completes the proof. ∎

6 Results on real and simulated data

We compare the proposed algorithm ptpqp with state-of-the-art approaches including: 1) the tensor power method tpm [2] and matrix simultaneous diagonalization, nojd0 and nojd1 [20]—two general tensor decomposition methods; 2) nonnegative tensor factorization hals [17]; and 3) generalized method of moments meld [39]. We use the online code provided by the corresponding authors.

6.1 Learning GDLMs on simulated data

We adapt a simulation study from [39] to compare runtime and accuracy of parameter estimation. We consider a GDLM where each variable takes categorical values $\left\{0,1,2,3\right\}$ and the parameters of the Dirichlet mixing distribution are $\{\alpha_{j}=0.1\}_{j=1}^{k}$ . We initially consider $25$ variables. The true parameters for each hidden component $h$ are drawn from the Dirichlet distribution $\text{Dir}\left(0.5,0.5,0.5,0.5\right)$ . The resulting moment estimator is a $100$ -by- $100$ -by- $100$ tensor. We vary the number of components $k$ and add noise by replacing a fraction $\delta$ of the observations with draws from a discrete uniform distribution. We also vary the number of samples $n=100,500,1000,5000$ , number of clusters $k=3,5,10,20$ , and contamination $\delta=0,0.05,0.1$ . Across these settings we found that the empirical third-order estimator typically exhibits between $20\%$ and $50\%$ negative entries.

Accuracy of inference

Accuracy is measured by root-mean-square error (RMSE) which we compare across algorithms as a function of the number of components for various sample sizes and levels of contamination, see Figure 1. Both hals and ptpqp are consistently among the top estimators, and ptpqp outperforms hals as $n$ grows. For small sample sizes and many hidden components meld achieves the smallest RMSE. The RMSE of tpm is relatively large, probably due to the whitening technique used to approximately transform the nonorthogonal factorization into an orthogonal one, see [33, 23]. The most relevant observation is that ptpqp outperforms other methods for large, noisy data.

Computational cost

We examined how runtime scales as a function of the number of partitions. For the same model we set $p=1000$ variables and $n=1000$ samples. The tensor is now $4000$ -by- $4000$ -by- $4000$ . We evaluated the runtime of ptpqp (without parallelization) with the number of partitions set to $\left\{30,40,50,100,200\right\}$ . On a laptop with Intel [email protected] CPU and 8GB memory, ptpqp with $100$ partitions completes within $3.5$ min, $4$ min, and $5$ min for $k=4,8,12$ , respectively. In addition, the runtime monotonically decreases with the number of partitions. Further speedups can be obtained by parallelizing the factorization of partitions across multiple CPUs or machines.

6.2 Predicting crowdsourced labels

In [38], a combination of EM and tensor decompositions was used to predict crowdsourcing annotations. The task is to predict the true label given incomplete and noisy observations from a set of workers, this is a mixed membership problem [9]. In [38] a third-order tensor estimator was proposed to obtain an initial estimate for the EM algorithm. We compare the predictive performance on five data sets of several tensor decomposition methods as well as the EM algorithm initialized with majority voting by the workers (MV+EM). The fraction of incorrect predictions and the size of each dataset are in the table below. Note that ptpqp matches or outperforms the other tensor methods on all but one dataset, and even outperforms MV+EM on two datasets.

7 Conclusions

We proposed an efficient algorithm for learning mixed mixture models based on the idea of partitioned factorizations. The key challenge is to consistently match the partitioned parameters with the hidden components. We provided sufficient conditions to ensure consistency. In addition, we have also developed a nonnegative approximation to handle the negative entries in the empirical method of moments estimators, a problem not addressed by several recent tensor methods. Results on synthetic and real data corroborate that the proposed approach achieves improved inference accuracy as well as computational efficiency than state-of-the-art methods.

Code

Code for all the simulations is available from Zilong Tan’s GitHub repository

https://github.com/ZilongTan/ptpqp.

Acknowledgements

Z.T. would like to thank Rong Ge for sharing helpful insights. S.M. would like to thank Lek-Heng Lim for insights. Z.T. would like to acknowledge the support of grants NSF CNS-1423128, NSF IIS-1423124, and NSF CNS-1218981. S.M. would like to acknowledge the support of grants NSF IIS-1546331, NSF DMS-1418261, NSF IIS-1320357, NSF DMS-1045153, and NSF DMS-1613261.

Appendix A Dirichlet Moments

For a Dirichlet random vector $\bm{x}$ with concentration parameters $\bm{\alpha}$ , the component moments can be easily shown by the integral

[TABLE]

where $\alpha_{0}=\sum_{i\geq 1}\alpha_{i}$ . Comparing the second and third order component moments, we arrive at the following cross-moments:

[TABLE]

To express the parameters as third-order cross-moments, first observe that the following holds for a Dirichlet random vector $\bm{x}$ :

[TABLE]

This is an immediate result from the the second-order component moments. Combining with (34) yields

[TABLE]

A.1 Derivation of moment estimators

Our goal is to derive the estimators of parameter vectors $\bm{\theta}_{j}$ for each variable $j$ using the first- and second- order empirical cross-moments of $\bm{b}_{ij}$ . In GDLM, the expectation of variable $j$ conditioned on $\bm{x}$ is written

[TABLE]

Thus, the expected observation of variable $j$ is given by

[TABLE]

Now consider two variables $\bm{b}_{ij}$ and $\bm{b}_{is}$ which are generated with the same latent factors $\bm{x}$ . Combining (36) and (33) to obtain

[TABLE]

For three variables $\bm{b}_{ij}$ , $\bm{b}_{is}$ , and $\bm{b}_{it}$ , we can write $\mathbb{E}\left[\bm{b}_{ij}\times\bm{b}_{is}\times\bm{b}_{it}\right]=\mathbb{E}\left[\bm{x}\times\bm{x}\times\bm{x}\right]\times_{1}\bm{\theta}_{j}\times_{2}\bm{\theta}_{s}\times_{3}\bm{\theta}_{t}$ . Using (35), we establish that

[TABLE]

Appendix B Approximate Orthogonalization in the Tensor Power Method

TPM requires the tensor to be decomposed to be symmetric, and the factor matrices to be orthogonal. Specifically, it performs the following decomposition

[TABLE]

where $\bm{u}_{i}$ are orthonormal vectors. Thus, TPM does not immediately apply to the general CP decomposition (1).

The general resolution is to first use the symmetric tensor embedding [27, 3], forming a larger symmetric tensor $\mathcal{M}_{3}$ that contains the asymmetric tensor to be decomposed. The formed $\mathcal{M}_{3}$ is a sparse $\left(\sum_{i=1}^{p}d_{i}\right)$ -by- $\left(\sum_{i=1}^{p}d_{i}\right)$ -by- $\left(\sum_{i=1}^{p}d_{i}\right)$ tensor of which $7/9$ entries are zero. The space and computation complexities rapidly become prohibitive when the number of variables $p$ and the category counts $d_{j}$ grow.

Next, TPM requires an addition empirical second-order estimator $\widehat{\mathcal{M}}_{2}$ for orthogonalizing the factor matrices of $\mathcal{M}_{3}$ to obtain $\mathcal{M}_{3}^{\prime}$ [2]. This is done by computing the whitening transformation from $\widehat{\mathcal{M}}_{2}$ . However, the whitening technique based on empirical $\widehat{\mathcal{M}}_{2}$ is often a cause of suboptimal performance [33, 23].

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Anandkumar, D. P. Foster, D. J. Hsu, S. M. Kakade, and Y. kai Liu. A spectral algorithm for latent Dirichlet allocation. In NIPS , pages 917–925. 2012.
2[2] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. JMLR , 15(1):2773–2832, Jan. 2014.
3[3] A. Anandkumar, P. Jain, Y. Shi, and U. N. Niranjan. Tensor vs. matrix methods: Robust tensor decomposition under block sparse perturbations. In AISTATS , pages 268–276, 2016.
4[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR , 3:993–1022, 2003.
5[5] M. Brand and D. Chen. Parallel quadratic programming for image processing. In IEEE International Conference on Image Processing (ICIP) , pages 2261–2264, Sept. 2011.
6[6] M. Brand, V. Shilpiekandula, and S. Bortoff. A parallel quadratic programming algorithm for model predictive control. In World Congress of the International Federation of Automatic Control (IFAC) , volume 18, Aug. 2011.
7[7] E. C. Chi and T. G. Kolda. On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications , 33(4):1272–1299, December 2012.
8[8] P. Comon and C. Jutten, editors. Handbook of blind source separation : independent component analysis and applications . Communications engineering. Elsevier, 2010.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Efficient Learning of Mixed Membership Models

Abstract

1 Introduction

2 Preliminaries

Notations

Tensor basics

3 Learning through Method of Moments

3.1 Generalized Dirichlet latent variable models

3.2 Moment-based estimators

3.3 Computational complexity

4 An efficient algorithm

4.1 Partitioned factorization

4.2 Matching parameters with hidden components

Smallest angle matching

Orthogonal Procrustes matching

Theorem 1**.**

Proof.

4.3 Approximate nonnegative factorization

4.4 Handling negative entries in empirical estimators

Issue of negative entries

4.5 Factorization via WNMF

Theorem 2**.**

Proof.

4.6 Parallel quadratic programming

Theorem 3**.**

Proof.

Theorem 4**.**

Proof.

4.7 Proposed approach

Efficiency

5 Provable Guarantees

5.1 Sufficient conditions for guaranteed matching

Theorem 5**.**

Proof.

Theorem 6**.**

Lemma 1**.**

Proof.

Lemma 2**.**

Proof of Theorem 6.

5.2 Convergence

Theorem 7**.**

Proof.

6 Results on real and simulated data

6.1 Learning GDLMs on simulated data

Accuracy of inference

Computational cost

6.2 Predicting crowdsourced labels

7 Conclusions

Code

Acknowledgements

Appendix A Dirichlet Moments

A.1 Derivation of moment estimators

Appendix B Approximate Orthogonalization in the Tensor Power Method

Theorem 1.

Theorem 2.

Theorem 3.

Theorem 4.

Theorem 5.

Theorem 6.

Lemma 1.

Lemma 2.

Theorem 7.