To Know by the Company Words Keep and What Else Lies in the Vicinity

Jake Ryland Williams; Hunter Scott Heidenreich

arXiv:2205.00148·cs.CL·May 3, 2022

To Know by the Company Words Keep and What Else Lies in the Vicinity

Jake Ryland Williams, Hunter Scott Heidenreich

PDF

Open Access

TL;DR

This paper introduces an analytic model for NLP algorithms like GloVe and Word2Vec, providing a first solution to Word2Vec's skip-gram, revealing universal properties of word vectors and enabling bias detection before deep learning models absorb biases.

Contribution

It derives the first known solution to Word2Vec's skip-gram algorithm and explores universal properties of word vectors for bias detection and understanding co-occurrence statistics.

Findings

01

First solution to Word2Vec's skip-gram algorithm

02

Universal property of word vectors enabling bias detection

03

Insights into co-occurrence statistical dependencies

Abstract

The development of state-of-the-art (SOTA) Natural Language Processing (NLP) systems has steadily been establishing new techniques to absorb the statistics of linguistic data. These techniques often trace well-known constructs from traditional theories, and we study these connections to close gaps around key NLP methods as a means to orient future work. For this, we introduce an analytic model of the statistics learned by seminal algorithms (including GloVe and Word2Vec), and derive insights for systems that use these algorithms and the statistics of co-occurrence, in general. In this work, we derive -- to the best of our knowledge -- the first known solution to Word2Vec's softmax-optimized, skip-gram algorithm. This result presents exciting potential for future development as a direct solution to a deep learning (DL) language model's (LM's) matrix factorization. However, we use the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques

MethodsGloVe Embeddings