Distributed Representations of Sentences and Documents

Quoc V. Le; Tomas Mikolov

arXiv:1405.4053·cs.CL·May 26, 2014·5.1k cites

Distributed Representations of Sentences and Documents

Quoc V. Le, Tomas Mikolov

PDF

Open Access 5 Repos

TL;DR

This paper introduces Paragraph Vector, an unsupervised method for learning fixed-length, dense vector representations of texts that capture semantics and word order, outperforming traditional bag-of-words models in various NLP tasks.

Contribution

The paper presents a novel unsupervised algorithm, Paragraph Vector, that effectively encodes variable-length texts into fixed-length vectors, addressing limitations of bag-of-words.

Findings

01

Paragraph Vectors outperform bag-of-words models.

02

Achieves state-of-the-art results on text classification.

03

Effective in sentiment analysis tasks.

Abstract

Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperform bag-of-words…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Natural Language Processing Techniques