Generic Embedding-Based Lexicons for Transparent and Reproducible Text Scoring
Catherine Moez

TL;DR
This paper introduces a method for creating transparent, high-performance text scoring lexicons using minimal input from pretrained word embeddings like FastText and GloVe, bridging the gap between opaque models and manual tools.
Contribution
It proposes a novel approach to generate lexicons from generic embeddings, combining transparency with competitive performance.
Findings
Lexicons created from FastText and GloVe embeddings are effective.
Embedding-based lexicons offer transparency and high performance.
The method requires minimal researcher input.
Abstract
With text analysis tools becoming increasingly sophisticated over the last decade, researchers now face a decision of whether to use state-of-the-art models that provide high performance but that can be highly opaque in their operations and computationally intensive to run. The alternative, frequently, is to rely on older, manually crafted textual scoring tools that are transparently and easily applied, but can suffer from limited performance. I present an alternative that combines the strengths of both: lexicons created with minimal researcher inputs from generic (pretrained) word embeddings. Presenting a number of conceptual lexicons produced from FastText and GloVe (6B) vector representations of words, I argue that embedding-based lexicons respond to a need for transparent yet high-performance text measuring tools.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsGloVe Embeddings · fastText
