TL;DR
This paper investigates the origins of linear analogy structures in word embeddings, revealing their emergence from co-occurrence statistics and attribute-based interactions, supported by a theoretical model that aligns with empirical data.
Contribution
It introduces a generative model explaining how linear analogies naturally arise in word embeddings from co-occurrence data and attribute interactions.
Findings
Linear analogies emerge from top eigenvectors of co-occurrence matrices.
Analogy strength increases with more eigenvectors and log transformations.
The model reproduces empirical analogy properties and is robust to noise.
Abstract
Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability of words and in text corpora. The resulting vectors not only group semantically similar words but also exhibit a striking linear analogy structure -- for example, -- whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix , (ii) strengthens and then saturates as more eigenvectors of , which controls the dimension of the embeddings, are included, (iii) is enhanced when using rather than , and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsGloVe Embeddings
