Measuring Word Significance using Distributed Representations of Words

Adriaan M. J. Schakel; Benjamin J. Wilson

arXiv:1508.02297·cs.CL·August 11, 2015·47 cites

Measuring Word Significance using Distributed Representations of Words

Adriaan M. J. Schakel, Benjamin J. Wilson

PDF

Open Access 1 Repo

TL;DR

This paper proposes using the length of word vectors from word2vec, combined with term frequency, as a new measure of word significance in text corpora, supported by experimental evidence and visualization techniques.

Contribution

It introduces a novel method to measure word importance using vector length and term frequency, enhancing text analysis and visualization.

Findings

01

Vector length correlates with word significance.

02

Combined measure improves importance ranking.

03

Visualization aids in understanding corpus structure.

Abstract

Distributed representations of words as real-valued vectors in a relatively low-dimensional space aim at extracting syntactic and semantic features from large text corpora. A recently introduced neural network, named word2vec (Mikolov et al., 2013a; Mikolov et al., 2013b), was shown to encode semantic information in the direction of the word vectors. In this brief report, it is proposed to use the length of the vectors, together with the term frequency, as measure of word significance in a corpus. Experimental evidence using a domain-specific corpus of abstracts is presented to support this proposal. A useful visualization technique for text corpora emerges, where words are mapped onto a two-dimensional plane and automatically ranked by significance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vackosar/fasttext-vector-norms-and-oov-words
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques