Shape of Elephant: Study of Macro Properties of Word Embeddings Spaces

Alexey Tikhonov

arXiv:2106.06964·cs.CL·June 15, 2021

Shape of Elephant: Study of Macro Properties of Word Embeddings Spaces

Alexey Tikhonov

PDF

Open Access

TL;DR

This paper investigates the global geometric structure of pre-trained word embeddings, revealing they form a high-dimensional simplex with identifiable vertices, and introduces a method to enumerate these vertices for GloVe and fastText.

Contribution

It uncovers the high-dimensional simplex shape of word embeddings and proposes a novel method to identify its vertices, enhancing understanding of embedding geometry.

Findings

01

Word embeddings form a high-dimensional simplex.

02

Vertices of the simplex can be effectively detected.

03

The method applies to GloVe and fastText embeddings.

Abstract

Pre-trained word representations became a key component in many NLP tasks. However, the global geometry of the word embeddings remains poorly understood. In this paper, we demonstrate that a typical word embeddings cloud is shaped as a high-dimensional simplex with interpretable vertices and propose a simple yet effective method for enumeration of these vertices. We show that the proposed method can detect and describe vertices of the simplex for GloVe and fasttext spaces.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques

MethodsfastText · GloVe Embeddings