Statistical Uncertainty in Word Embeddings: GloVe-V

Andrea Vallebueno; Cassandra Handan-Nader; Christopher D. Manning,; Daniel E. Ho

arXiv:2406.12165·cs.CL·June 19, 2024

Statistical Uncertainty in Word Embeddings: GloVe-V

Andrea Vallebueno, Cassandra Handan-Nader, Christopher D. Manning,, Daniel E. Ho

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces GloVe-V, a scalable method to estimate statistical uncertainty in GloVe word embeddings, enabling more rigorous hypothesis testing and bias analysis in social science applications.

Contribution

The paper develops an analytical approximation to quantify variance in GloVe embeddings, facilitating uncertainty assessment and hypothesis testing.

Findings

01

Enables principled hypothesis testing with embedding variance

02

Allows comparison of model performance considering uncertainty

03

Supports bias analysis in social science datasets

Abstract

Static word embeddings are ubiquitous in computational social science applications and contribute to practical decision-making in a variety of fields including law and healthcare. However, assessing the statistical uncertainty in downstream conclusions drawn from word embedding statistics has remained challenging. When using only point estimates for embeddings, researchers have no streamlined way of assessing the degree to which their model selection criteria or scientific conclusions are subject to noise due to sparsity in the underlying data used to generate the embeddings. We introduce a method to obtain approximate, easy-to-use, and scalable reconstruction error variance estimates for GloVe (Pennington et al., 2014), one of the most widely used word embedding models, using an analytical approximation to a multivariate normal model. To demonstrate the value of embeddings with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

reglab/glove-v
noneOfficial

Datasets

reglab/glove-v
dataset· 82 dl
82 dl

Videos

Statistical Uncertainty in Word Embeddings: GloVe-V· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsGloVe Embeddings