Dirichlet-vMF Mixture Model

Shaohua Li

arXiv:1702.07495·cs.CL·February 27, 2017

Dirichlet-vMF Mixture Model

Shaohua Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces VMFMix, a mixture model that captures co-occurrence patterns across documents using a continuous hypersphere, enabling the derivation of topic embeddings from multiple embedding sets.

Contribution

The paper proposes VMFMix, a novel Dirichlet-vMF mixture model with an efficient inference algorithm for deriving topic embeddings on a hypersphere.

Findings

01

VMFMix performs well on document classification tasks.

02

The model effectively captures co-occurrence patterns across documents.

03

Preliminary analysis shows promising results.

Abstract

This document is about the multi-document Von-Mises-Fisher mixture model with a Dirichlet prior, referred to as VMFMix. VMFMix is analogous to Latent Dirichlet Allocation (LDA) in that they can capture the co-occurrence patterns acorss multiple documents. The difference is that in VMFMix, the topic-word distribution is defined on a continuous n-dimensional hypersphere. Hence VMFMix is used to derive topic embeddings, i.e., representative vectors, from multiple sets of embedding vectors. An efficient Variational Expectation-Maximization inference algorithm is derived. The performance of VMFMix on two document classification tasks is reported, with some preliminary analysis.

Tables1

Table 1. Table 1: Performance on multi-class text classification. Best score is in boldface.

	20News			Reuters
	Prec	Rec	F1	Prec	Rec	F1
BOW	69.1	68.5	68.6	92.5	90.3	91.1
LDA	61.9	61.4	60.3	76.1	74.3	74.8
sLDA	61.4	60.9	60.9	88.3	83.3	85.1
LFTM	63.5	64.8	63.7	84.6	86.3	84.9
MeanWV	70.4	70.3	70.1	92.0	89.6	90.5
Doc2Vec	56.3	56.6	55.4	84.4	50.0	58.5
TWE	69.5	69.3	68.8	91.0	89.1	89.9
TopicVec	71.3	71.3	71.2	92.5	92.1	92.2
VMFMix	63.8	63.9	63.7	87.9	88.7	88.0

Equations31

p (X, Z, Θ ∣ α, {μ_{k}, κ_{k}}) = i \prod Dir (θ_{i} ∣ α) j \prod θ_{i, z_{ij}} vMF (x_{ij} ∣ μ_{z_{ij}}, κ_{z_{ij}}) .

p (X, Z, Θ ∣ α, {μ_{k}, κ_{k}}) = i \prod Dir (θ_{i} ∣ α) j \prod θ_{i, z_{ij}} vMF (x_{ij} ∣ μ_{z_{ij}}, κ_{z_{ij}}) .

p (X ∣ α, {μ_{k}, κ_{k}}) = \int d Θ \cdot i \prod Dir (θ_{i} ∣ α) j \prod k \sum θ_{ik} vMF (x_{ij} ∣ μ_{k}, κ_{k}) .

p (X ∣ α, {μ_{k}, κ_{k}}) = \int d Θ \cdot i \prod Dir (θ_{i} ∣ α) j \prod k \sum θ_{ik} vMF (x_{ij} ∣ μ_{k}, κ_{k}) .

lo g p (X ∣ α, {μ_{k}, κ_{k}})

lo g p (X ∣ α, {μ_{k}, κ_{k}})

= L (q, {μ_{k}, κ_{k}})

q(\boldsymbol{Z},\boldsymbol{\Theta})=\prod_{i}\Bigl{\{}\textrm{Dir}(\boldsymbol{\theta}_{i}|\boldsymbol{\phi}_{i})\prod_{j}\textrm{Cat}(z_{ij}|\boldsymbol{\pi}_{ij})\Bigr{\}}.

q(\boldsymbol{Z},\boldsymbol{\Theta})=\prod_{i}\Bigl{\{}\textrm{Dir}(\boldsymbol{\theta}_{i}|\boldsymbol{\phi}_{i})\prod_{j}\textrm{Cat}(z_{ij}|\boldsymbol{\pi}_{ij})\Bigr{\}}.

L (q, {μ_{k}, κ_{k}})

L (q, {μ_{k}, κ_{k}})

=

\displaystyle+\sum_{i,j,k}\delta(z_{ij}=k)(\log\theta_{ik}+\log c_{d}(\kappa_{k})+\kappa_{k}\boldsymbol{\mu}_{k}^{\operatorname*{\scriptscriptstyle\top}}\boldsymbol{x}_{ij})\Bigr{]}

=

\displaystyle+\sum_{k}\Bigl{(}n_{\cdot\cdot k}\cdot\log c_{d}(\kappa_{k})+\kappa_{k}\boldsymbol{\mu}_{k}^{\operatorname*{\scriptscriptstyle\top}}\boldsymbol{r}_{k}\Bigr{)},

n_{i \cdot k}

n_{i \cdot k}

r_{k}

H (q) =

H (q) =

=

\displaystyle-\sum_{j,k}\delta(z_{ij}=k)\log\pi_{ijk}\Bigr{]}

=

+ (ϕ_{i 0} - K) ψ (ϕ_{i 0}) - j, k \sum π_{ij k} lo g π_{ij k} .

π_{ij k}

π_{ij k}

ϕ_{ik}

μ_{k}

μ_{k}

\overset{r}{ˉ}_{k}

κ_{k}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

askerlee/vmfmix
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Text and Document Classification Technologies

Full text

Dirichlet-vMF Mixture Model

Shaohua Li

[email protected]

National University of Singapore

Abstract

This document is about the multi-document Von-Mises-Fisher mixture model with a Dirichlet prior, referred to as VMFMix. VMFMix is analogous to Latent Dirichlet Allocation (LDA) in that they can capture the co-occurrence patterns acorss multiple documents. The difference is that in VMFMix, the topic-word distribution is defined on a continuous n-dimensional hypersphere. Hence VMFMix is used to derive topic embeddings, i.e., representative vectors, from multiple sets of embedding vectors. An efficient Variational Expectation-Maximization inference algorithm is derived. The performance of VMFMix on two document classification tasks is reported, with some preliminary analysis.

We present a simplification of the Bayesian vMF mixture model proposed in [2]111This model reappears in [4] under the name “mix-vMF topic model”. But [4] only offers a sampling-based inference scheme, which is usually less accurate than the EM algorithm presented in this document.. For computational efficiency, the priors on the vMF mean $\{\boldsymbol{\mu}_{k}\}$ and on the vMF concentration $\{\kappa_{k}\}$ are removed. This model is referred to as VMFMix.

A Python implementation of VMFMix is available at https://github.com/askerlee/vmfmix.

1 Model Specification

The generative process is as follows:

$\boldsymbol{\theta}_{i}\sim\textrm{Dir}(\alpha)$ ; 2. 2.

$z_{ij}\sim\textrm{Cat}(\boldsymbol{\theta}_{i})$ ; 3. 3.

$\boldsymbol{x}_{ij}\sim\textrm{vMF}(\boldsymbol{\mu}_{z_{ij}},\kappa_{z_{ij}})$ .

Here $\alpha$ is a hyperparameter, $\{\boldsymbol{\mu}_{k},\kappa_{k}\}$ are parameters of mixture components to be learned.

2 Model Likelihood and Inference

Given parameters $\{\boldsymbol{\mu}_{k},\kappa_{k}\}$ , the complete-data likelihood of a dataset $\{\boldsymbol{X},\boldsymbol{Z},\boldsymbol{\Theta}\}=\{\boldsymbol{x}_{ij},z_{ij},\boldsymbol{\theta}_{i}\}$ is:

[TABLE]

The incomplete-data likelihood of $\{\boldsymbol{X},\boldsymbol{\Theta}\}=\{\boldsymbol{x}_{ij},\boldsymbol{\theta}_{i}\}$ is obtained by integrating out the latent variables $\boldsymbol{Z},\boldsymbol{\Theta}$ :

[TABLE]

(2) is apparently intractable, and instead we seek its variational lower bound:

[TABLE]

It is natural to use the following variational distribution to approximate the posterior distribution of $\boldsymbol{Z},\boldsymbol{\Theta}$ :

[TABLE]

Then the variational lower bound is

[TABLE]

where

[TABLE]

and $\mathcal{H}(q)$ is the entropy of $q(\boldsymbol{Z},\boldsymbol{\Theta})$ :

[TABLE]

By taking the partial derivative of (5) w.r.t. $\pi_{ijk},\phi_{ik},\boldsymbol{\mu}_{k},\kappa_{k},$ respectively, we can obtain the following variational EM update equations [1, 2, 4].

2.1 E-Step

[TABLE]

2.2 M-Step

[TABLE]

The update equation of $\kappa_{k}$ adopts the approximation proposed in [1].

3 Evaluation

The performance of this model was evaluated on two text classification tasks that are on 20 Newsgroups (20News) and Reuters, respectively. The experimental setup for the compared methods were identical to that in [3]. Similar to TopicVec, VMFMix learns an individual set of $K$ topic embeddings from each category of documents, and all these sets are combined to form a bigger set of topic embeddings for the whole corpus. This set of topic embeddings are used to derive the topic proportions of each document, which are taken as features for the SVM classifier. The $K$ for 20News and Reuters are chosen as 15 and 12, respectively, which are identical to TopicVec.

The macro-averaged precision, recall and F1 scores of all methods are presented in Table 1.

We can see from Table 1 that, VMFMix achieves better performance than Doc2Vec, LDA, sLDA and LFTM. However, its performance is still inferior to BOW, Mean word embeddings (MeanWV), TWE and TopicVec. The reason might be that by limiting the embeddings in the unit hypersphere (effectively normalizing them as unit vectors), certain representational flexibility is lost.

An empirical observation we have is that, VMFMix approaches convergence very quickly. The variational lower bound increases only slightly after 10~20 iterations. By manually checking the intermediate parameter values, we see that after so many iterations, the parameters change very little too. It suggests that VMFMix might easily get stuck in local optima.

Nonetheless, VMFMix might still be relevant when the considered embedding vectors are infinite and continuously distributed in the embedding space, as opposed to the finite vocabulary of word embeddings222Each set of word embeddings can be viewed as a finite and discrete sample from a continuous embedding space.. Such scenarios include the neural encodings of images from a convolutional neural network (CNN).

Bibliography4

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, and Suvrit Sra. Clustering on the unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research , 6(Sep):1345–1382, 2005.
2[2] Siddharth Gopal and Yiming Yang. Von mises-fisher clustering models. In ICML , pages 154–162, 2014.
3[3] Shaohua Li, Tat-Seng Chua, Jun Zhu, and Chunyan Miao. Generative topic embedding: a continuous representation of documents. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers , 2016.
4[4] Ximing Li, Jinjin Chi, Changchun Li, Jihong Ou Yang, and Bo Fu. Integrating topic modeling with word embeddings by mixtures of vmfs. In COLING , 2016.