A Latent Variable Model Approach to PMI-based Word Embeddings
Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, Andrej Risteski

TL;DR
This paper introduces a probabilistic generative model for word embeddings that explains the effectiveness of PMI, word2vec, and GloVe, providing theoretical insights and experimental validation for their structure and hyperparameters.
Contribution
It presents a new latent variable model that offers a theoretical foundation for nonlinear word embedding methods and explains their geometric properties.
Findings
Supports the generative model assumptions with experimental evidence.
Shows that latent word vectors are fairly uniformly dispersed in space.
Provides closed-form expressions for word statistics using the prior.
Abstract
Semantic word embeddings represent the meaning of a word via a vector, and are created by diverse methods. Many use nonlinear operations on co-occurrence statistics, and have hand-tuned hyperparameters and reweighting methods. This paper proposes a new generative model, a dynamic version of the log-linear topic model of~\citet{mnih2007three}. The methodological novelty is to use the prior to compute closed form expressions for word statistics. This provides a theoretical justification for nonlinear models like PMI, word2vec, and GloVe, as well as some hyperparameter choices. It also helps explain why low-dimensional semantic embeddings contain linear algebraic structure that allows solution of word analogies, as shown by~\citet{mikolov2013efficient} and many subsequent papers. Experimental support is provided for the generative model assumptions, the most important of which is that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
MethodsGloVe Embeddings
