
TL;DR
This paper introduces VMFMix, a mixture model that captures co-occurrence patterns across documents using a continuous hypersphere, enabling the derivation of topic embeddings from multiple embedding sets.
Contribution
The paper proposes VMFMix, a novel Dirichlet-vMF mixture model with an efficient inference algorithm for deriving topic embeddings on a hypersphere.
Findings
VMFMix performs well on document classification tasks.
The model effectively captures co-occurrence patterns across documents.
Preliminary analysis shows promising results.
Abstract
This document is about the multi-document Von-Mises-Fisher mixture model with a Dirichlet prior, referred to as VMFMix. VMFMix is analogous to Latent Dirichlet Allocation (LDA) in that they can capture the co-occurrence patterns acorss multiple documents. The difference is that in VMFMix, the topic-word distribution is defined on a continuous n-dimensional hypersphere. Hence VMFMix is used to derive topic embeddings, i.e., representative vectors, from multiple sets of embedding vectors. An efficient Variational Expectation-Maximization inference algorithm is derived. The performance of VMFMix on two document classification tasks is reported, with some preliminary analysis.
| 20News | Reuters | |||||
|---|---|---|---|---|---|---|
| Prec | Rec | F1 | Prec | Rec | F1 | |
| BOW | 69.1 | 68.5 | 68.6 | 92.5 | 90.3 | 91.1 |
| LDA | 61.9 | 61.4 | 60.3 | 76.1 | 74.3 | 74.8 |
| sLDA | 61.4 | 60.9 | 60.9 | 88.3 | 83.3 | 85.1 |
| LFTM | 63.5 | 64.8 | 63.7 | 84.6 | 86.3 | 84.9 |
| MeanWV | 70.4 | 70.3 | 70.1 | 92.0 | 89.6 | 90.5 |
| Doc2Vec | 56.3 | 56.6 | 55.4 | 84.4 | 50.0 | 58.5 |
| TWE | 69.5 | 69.3 | 68.8 | 91.0 | 89.1 | 89.9 |
| TopicVec | 71.3 | 71.3 | 71.2 | 92.5 | 92.1 | 92.2 |
| VMFMix | 63.8 | 63.9 | 63.7 | 87.9 | 88.7 | 88.0 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Text and Document Classification Technologies
Dirichlet-vMF Mixture Model
Shaohua Li
National University of Singapore
Abstract
This document is about the multi-document Von-Mises-Fisher mixture model with a Dirichlet prior, referred to as VMFMix. VMFMix is analogous to Latent Dirichlet Allocation (LDA) in that they can capture the co-occurrence patterns acorss multiple documents. The difference is that in VMFMix, the topic-word distribution is defined on a continuous n-dimensional hypersphere. Hence VMFMix is used to derive topic embeddings, i.e., representative vectors, from multiple sets of embedding vectors. An efficient Variational Expectation-Maximization inference algorithm is derived. The performance of VMFMix on two document classification tasks is reported, with some preliminary analysis.
We present a simplification of the Bayesian vMF mixture model proposed in [2]111This model reappears in [4] under the name “mix-vMF topic model”. But [4] only offers a sampling-based inference scheme, which is usually less accurate than the EM algorithm presented in this document.. For computational efficiency, the priors on the vMF mean and on the vMF concentration are removed. This model is referred to as VMFMix.
A Python implementation of VMFMix is available at https://github.com/askerlee/vmfmix.
1 Model Specification
The generative process is as follows:
; 2. 2.
; 3. 3.
.
Here is a hyperparameter, are parameters of mixture components to be learned.
2 Model Likelihood and Inference
Given parameters , the complete-data likelihood of a dataset is:
[TABLE]
The incomplete-data likelihood of is obtained by integrating out the latent variables :
[TABLE]
(2) is apparently intractable, and instead we seek its variational lower bound:
[TABLE]
It is natural to use the following variational distribution to approximate the posterior distribution of :
[TABLE]
Then the variational lower bound is
[TABLE]
where
[TABLE]
and is the entropy of :
[TABLE]
By taking the partial derivative of (5) w.r.t. respectively, we can obtain the following variational EM update equations [1, 2, 4].
2.1 E-Step
[TABLE]
2.2 M-Step
[TABLE]
The update equation of adopts the approximation proposed in [1].
3 Evaluation
The performance of this model was evaluated on two text classification tasks that are on 20 Newsgroups (20News) and Reuters, respectively. The experimental setup for the compared methods were identical to that in [3]. Similar to TopicVec, VMFMix learns an individual set of topic embeddings from each category of documents, and all these sets are combined to form a bigger set of topic embeddings for the whole corpus. This set of topic embeddings are used to derive the topic proportions of each document, which are taken as features for the SVM classifier. The for 20News and Reuters are chosen as 15 and 12, respectively, which are identical to TopicVec.
The macro-averaged precision, recall and F1 scores of all methods are presented in Table 1.
We can see from Table 1 that, VMFMix achieves better performance than Doc2Vec, LDA, sLDA and LFTM. However, its performance is still inferior to BOW, Mean word embeddings (MeanWV), TWE and TopicVec. The reason might be that by limiting the embeddings in the unit hypersphere (effectively normalizing them as unit vectors), certain representational flexibility is lost.
An empirical observation we have is that, VMFMix approaches convergence very quickly. The variational lower bound increases only slightly after 10~20 iterations. By manually checking the intermediate parameter values, we see that after so many iterations, the parameters change very little too. It suggests that VMFMix might easily get stuck in local optima.
Nonetheless, VMFMix might still be relevant when the considered embedding vectors are infinite and continuously distributed in the embedding space, as opposed to the finite vocabulary of word embeddings222Each set of word embeddings can be viewed as a finite and discrete sample from a continuous embedding space.. Such scenarios include the neural encodings of images from a convolutional neural network (CNN).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, and Suvrit Sra. Clustering on the unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research , 6(Sep):1345–1382, 2005.
- 2[2] Siddharth Gopal and Yiming Yang. Von mises-fisher clustering models. In ICML , pages 154–162, 2014.
- 3[3] Shaohua Li, Tat-Seng Chua, Jun Zhu, and Chunyan Miao. Generative topic embedding: a continuous representation of documents. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers , 2016.
- 4[4] Ximing Li, Jinjin Chi, Changchun Li, Jihong Ou Yang, and Bo Fu. Integrating topic modeling with word embeddings by mixtures of vmfs. In COLING , 2016.
