Unsupervised Low-Dimensional Vector Representations for Words, Phrases   and Text that are Transparent, Scalable, and produce Similarity Metrics that   are Complementary to Neural Embeddings

Neil R. Smalheiser; Gary Bonifield

arXiv:1801.01884·cs.CL·January 10, 2018·2 cites

Unsupervised Low-Dimensional Vector Representations for Words, Phrases and Text that are Transparent, Scalable, and produce Similarity Metrics that are Complementary to Neural Embeddings

Neil R. Smalheiser, Gary Bonifield

PDF

Open Access

TL;DR

This paper introduces a simple, interpretable, unsupervised low-dimensional vector representation for words and text, which correlates well with human judgments and outperforms neural embeddings on biomedical benchmarks.

Contribution

The authors present a transparent, scalable vector method for text representation that produces novel similarity metrics and is easier to interpret than neural embeddings.

Findings

01

Implicit similarity metrics outperform or match neural embeddings on biomedical benchmarks.

02

The vector representations are publicly available for research and practical use.

03

Implicit metrics capture different aspects of word relatedness than neural embedding-based metrics.

Abstract

Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Text Analysis Techniques · Natural Language Processing Techniques