Shape of Elephant: Study of Macro Properties of Word Embeddings Spaces
Alexey Tikhonov

TL;DR
This paper investigates the global geometric structure of pre-trained word embeddings, revealing they form a high-dimensional simplex with identifiable vertices, and introduces a method to enumerate these vertices for GloVe and fastText.
Contribution
It uncovers the high-dimensional simplex shape of word embeddings and proposes a novel method to identify its vertices, enhancing understanding of embedding geometry.
Findings
Word embeddings form a high-dimensional simplex.
Vertices of the simplex can be effectively detected.
The method applies to GloVe and fastText embeddings.
Abstract
Pre-trained word representations became a key component in many NLP tasks. However, the global geometry of the word embeddings remains poorly understood. In this paper, we demonstrate that a typical word embeddings cloud is shaped as a high-dimensional simplex with interpretable vertices and propose a simple yet effective method for enumeration of these vertices. We show that the proposed method can detect and describe vertices of the simplex for GloVe and fasttext spaces.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
MethodsfastText · GloVe Embeddings
