Word Embeddings as Statistical Estimators

Neil Dey; Matthew Singer; Jonathan P. Williams; Srijan Sengupta

arXiv:2301.06710·stat.ME·January 18, 2023

Word Embeddings as Statistical Estimators

Neil Dey, Matthew Singer, Jonathan P. Williams, Srijan Sengupta

PDF

Open Access 1 Repo

TL;DR

This paper provides a statistical theoretical framework for understanding word embeddings, interpreting Word2Vec as an estimator of PMI, and introduces a new estimator with comparable performance.

Contribution

It develops a copula-based statistical model for text data and proposes a new, interpretable estimator that matches Word2Vec's performance and offers theoretical insights.

Findings

01

The new estimator's error is comparable to Word2Vec.

02

The estimator outperforms the truncation-based method.

03

Performs well on sentiment analysis benchmark.

Abstract

Word embeddings are a fundamental tool in natural language processing. Currently, word embedding methods are evaluated on the basis of empirical performance on benchmark data sets, and there is a lack of rigorous understanding of their theoretical properties. This paper studies word embeddings from a statistical theoretical perspective, which is essential for formal inference and uncertainty quantification. We propose a copula-based statistical model for text data and show that under this model, the now-classical Word2Vec method can be interpreted as a statistical estimation method for estimating the theoretical pointwise mutual information (PMI). Next, by building on the work of Levy and Goldberg (2014), we develop a missing value-based estimator as a statistically tractable and interpretable alternative to the Word2Vec approach. The estimation error of this estimator is comparable to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

neil-dey/word-embeddings-as-estimators
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Sentiment Analysis and Opinion Mining