Topic Modeling over Short Texts by Incorporating Word Embeddings
Jipeng Qiang, Ping Chen, Tong Wang, Xindong Wu

TL;DR
This paper introduces ETM, a novel topic modeling approach for short texts that leverages word embeddings and word correlation knowledge to improve topic coherence and overcome data sparsity issues.
Contribution
The paper proposes a new embedding-based topic model that combines pseudo-text aggregation with a Markov Random Field regularization to enhance short text topic modeling.
Findings
ETM outperforms state-of-the-art models on real-world datasets.
Incorporating word embeddings improves topic coherence.
Using MRF regularization enhances word-topic assignments.
Abstract
Inferring topics from the overwhelming amount of short texts becomes a critical but challenging task for many content analysis tasks, such as content charactering, user interest profiling, and emerging topic detecting. Existing methods such as probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) cannot solve this prob- lem very well since only very limited word co-occurrence information is available in short texts. This paper studies how to incorporate the external word correlation knowledge into short texts to improve the coherence of topic modeling. Based on recent results in word embeddings that learn se- mantically representations for words from a large corpus, we introduce a novel method, Embedding-based Topic Model (ETM), to learn latent topics from short texts. ETM not only solves the problem of very limited word co-occurrence information by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Advanced Text Analysis Techniques
