An alternative text representation to TF-IDF and Bag-of-Words
Zhixiang (Eddie) Xu, Minmin Chen, Kilian Q. Weinberger, Fei Sha

TL;DR
This paper introduces Dense Cohort of Terms (dCoT), an unsupervised method that transforms sparse Bag of Words representations into dense features, improving text classification especially with limited or short texts.
Contribution
The paper presents dCoT, a novel unsupervised algorithm that reconstructs frequent words from co-occurring infrequent words, reducing sparsity and enhancing feature quality for text classification.
Findings
dCoT significantly improves classification accuracy on benchmark datasets.
The method effectively captures word co-occurrence patterns.
Dense features outperform traditional TF-IDF and Bag-of-Words representations.
Abstract
In text mining, information retrieval, and machine learning, text documents are commonly represented through variants of sparse Bag of Words (sBoW) vectors (e.g. TF-IDF). Although simple and intuitive, sBoW style representations suffer from their inherent over-sparsity and fail to capture word-level synonymy and polysemy. Especially when labeled data is limited (e.g. in document classification), or the text documents are short (e.g. emails or abstracts), many features are rarely observed within the training corpus. This leads to overfitting and reduced generalization accuracy. In this paper we propose Dense Cohort of Terms (dCoT), an unsupervised algorithm to learn improved sBoW document features. dCoT explicitly models absent words by removing and reconstructing random sub-sets of words in the unlabeled corpus. With this approach, dCoT learns to reconstruct frequent words from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Algorithms and Data Compression
