Efficient Vector Representation for Documents through Corruption
Minmin Chen

TL;DR
This paper introduces Doc2VecC, an efficient document embedding method that uses word averaging and corruption-based regularization to improve semantic capture and outperform existing models in various NLP tasks.
Contribution
The paper proposes a novel, simple, and scalable document representation framework that incorporates corruption for better semantic encoding and efficiency.
Findings
Outperforms state-of-the-art in sentiment analysis and classification
Produces superior word embeddings compared to Word2Vec
Enables fast training on billions of words
Abstract
We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms
