Word Embeddings as Statistical Estimators
Neil Dey, Matthew Singer, Jonathan P. Williams, Srijan Sengupta

TL;DR
This paper provides a statistical theoretical framework for understanding word embeddings, interpreting Word2Vec as an estimator of PMI, and introduces a new estimator with comparable performance.
Contribution
It develops a copula-based statistical model for text data and proposes a new, interpretable estimator that matches Word2Vec's performance and offers theoretical insights.
Findings
The new estimator's error is comparable to Word2Vec.
The estimator outperforms the truncation-based method.
Performs well on sentiment analysis benchmark.
Abstract
Word embeddings are a fundamental tool in natural language processing. Currently, word embedding methods are evaluated on the basis of empirical performance on benchmark data sets, and there is a lack of rigorous understanding of their theoretical properties. This paper studies word embeddings from a statistical theoretical perspective, which is essential for formal inference and uncertainty quantification. We propose a copula-based statistical model for text data and show that under this model, the now-classical Word2Vec method can be interpreted as a statistical estimation method for estimating the theoretical pointwise mutual information (PMI). Next, by building on the work of Levy and Goldberg (2014), we develop a missing value-based estimator as a statistically tractable and interpretable alternative to the Word2Vec approach. The estimation error of this estimator is comparable to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Sentiment Analysis and Opinion Mining
