On the Emergence of Linear Analogies in Word Embeddings

Daniel J. Korchinski; Dhruva Karkada; Yasaman Bahri; Matthieu Wyart

arXiv:2505.18651·cs.CL·October 24, 2025

On the Emergence of Linear Analogies in Word Embeddings

Daniel J. Korchinski, Dhruva Karkada, Yasaman Bahri, Matthieu Wyart

PDF

1 Video

TL;DR

This paper investigates the origins of linear analogy structures in word embeddings, revealing their emergence from co-occurrence statistics and attribute-based interactions, supported by a theoretical model that aligns with empirical data.

Contribution

It introduces a generative model explaining how linear analogies naturally arise in word embeddings from co-occurrence data and attribute interactions.

Findings

01

Linear analogies emerge from top eigenvectors of co-occurrence matrices.

02

Analogy strength increases with more eigenvectors and log transformations.

03

The model reproduces empirical analogy properties and is robust to noise.

Abstract

Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability $P (i, j)$ of words $i$ and $j$ in text corpora. The resulting vectors $W_{i}$ not only group semantically similar words but also exhibit a striking linear analogy structure -- for example, $W_{king} - W_{man} + W_{woman} \approx W_{queen}$ -- whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix $M (i, j) = P (i, j) / P (i) P (j)$ , (ii) strengthens and then saturates as more eigenvectors of $M (i, j)$ , which controls the dimension of the embeddings, are included, (iii) is enhanced when using $lo g M (i, j)$ rather than $M (i, j)$ , and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

On the Emergence of Linear Analogies in Word Embeddings· slideslive

Taxonomy

MethodsGloVe Embeddings