Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec
Christopher E Moody

TL;DR
Lda2vec combines word embeddings and topic models to produce interpretable, sparse document representations while learning dense word vectors, enhancing semantic understanding in language processing.
Contribution
This work introduces lda2vec, a novel model that jointly learns word vectors and topic distributions with a simple, differentiable framework for interpretable document representations.
Findings
Produces sparse, interpretable document mixtures
Jointly learns word vectors and topic relationships
Easily integrated into existing frameworks
Abstract
Distributed dense word vectors have been shown to be effective at capturing token-level semantic and syntactic regularities in language, while topic models can form interpretable representations over documents. In this work, we describe lda2vec, a model that learns dense word vectors jointly with Dirichlet-distributed latent document-level mixtures of topic vectors. In contrast to continuous dense document representations, this formulation produces sparse, interpretable document mixtures through a non-negative simplex constraint. Our method is simple to incorporate into existing automatic differentiation frameworks and allows for unsupervised document representations geared for use by scientists while simultaneously learning word vectors and the linear relationships between them.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
Methodslda2vec
