Do Word Embeddings Really Understand Loughran-McDonald's Polarities?
Mengda Li, Charles-Albert Lehalle

TL;DR
This paper provides a mathematical analysis of the word2vec model, revealing how it captures language structure and frequently treats antonyms as synonyms, especially in financial texts, with empirical validation on large corpora.
Contribution
It offers a theoretical framework explaining how word2vec embeddings form under Markovian assumptions and evaluates their ability to capture linguistic and domain-specific structures.
Findings
Word2vec captures frequentist synonyms under certain assumptions.
Embeddings tend to mix antonyms with opposite polarities in financial texts.
Financial corpus embeddings reflect non-stationarity and complex semantic distributions.
Abstract
In this paper we perform a rigorous mathematical analysis of the word2vec model, especially when it is equipped with the Skip-gram learning scheme. Our goal is to explain how embeddings, that are now widely used in NLP (Natural Language Processing), are influenced by the distribution of terms in the documents of the considered corpus. We use a mathematical formulation to shed light on how the decision to use such a model makes implicit assumptions on the structure of the language. We show how Markovian assumptions, that we discuss, lead to a very clear theoretical understanding of the formation of embeddings, and in particular the way it captures what we call frequentist synonyms. These assumptions allow to produce generative models and to conduct an explicit analysis of the loss function commonly used by these NLP techniques. Moreover, we produce synthetic corpora with different levels…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Natural Language Processing Techniques
