Efficient Vector Representation for Documents through Corruption

Minmin Chen

arXiv:1707.02377·cs.CL·July 11, 2017·78 cites

Efficient Vector Representation for Documents through Corruption

Minmin Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces Doc2VecC, an efficient document embedding method that uses word averaging and corruption-based regularization to improve semantic capture and outperform existing models in various NLP tasks.

Contribution

The paper proposes a novel, simple, and scalable document representation framework that incorporates corruption for better semantic encoding and efficiency.

Findings

01

Outperforms state-of-the-art in sentiment analysis and classification

02

Produces superior word embeddings compared to Word2Vec

03

Enables fast training on billions of words

Abstract

We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mchen24/iclr2017
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms