The Influence of Feature Representation of Text on the Performance of Document Classification
Sanda Martin\v{c}i\'c-Ip\v{s}i\'c, Tanja Mili\v{c}i\'c, Ljup\v{c}o, Todorovski

TL;DR
This study compares three text feature representation models—bag-of-words, word2vec/doc2vec, and language networks—for document classification, finding that bag-of-words and doc2vec perform similarly, with doc2vec excelling on large documents.
Contribution
It provides a comprehensive empirical comparison of traditional and emerging text representation models for document classification.
Findings
Bag-of-words and doc2vec have comparable performance.
Low-dimensional doc2vec variants perform well.
Doc2vec outperforms in classifying large documents.
Abstract
In this paper we perform a comparative analysis of three models for feature representation of text documents in the context of document classification. In particular, we consider the most often used family of models bag-of-words, recently proposed continuous space models word2vec and doc2vec, and the model based on the representation of text documents as language networks. While the bag-of-word models have been extensively used for the document classification task, the performance of the other two models for the same task have not been well understood. This is especially true for the network-based model that have been rarely considered for representation of text documents for classification. In this study, we measure the performance of the document classifiers trained using the method of random forests for features generated the three models and their variants. The results of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
