Recurrent Neural Network Language Model Adaptation Derived Document Vector
Wei Li, Brian Kan Wing Mak

TL;DR
This paper introduces a novel document vector representation derived from adapting RNN language models to capture sequential information, improving genre classification performance over traditional methods.
Contribution
It proposes a new document vector method based on adapting RNN and LSTM language models, capturing sequential information ignored by previous models.
Findings
DV-LSTM outperforms TF-IDF and PV-DM in genre classification
Combining proposed vectors with existing methods further improves accuracy
Document vectors effectively encode high-level sequential information
Abstract
In many natural language processing (NLP) tasks, a document is commonly modeled as a bag of words using the term frequency-inverse document frequency (TF-IDF) vector. One major shortcoming of the frequency-based TF-IDF feature vector is that it ignores word orders that carry syntactic and semantic relationships among the words in a document, and they can be important in some NLP tasks such as genre classification. This paper proposes a novel distributed vector representation of a document: a simple recurrent-neural-network language model (RNN-LM) or a long short-term memory RNN language model (LSTM-LM) is first created from all documents in a task; some of the LM parameters are then adapted by each document, and the adapted parameters are vectorized to represent the document. The new document vectors are labeled as DV-RNN and DV-LSTM respectively. We believe that our new document…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies · Advanced Text Analysis Techniques
