Short-Text Classification Using Unsupervised Keyword Expansion
Duncan Cameron-Steinke

TL;DR
This paper introduces an unsupervised method for expanding short texts with relevant keywords generated from a pre-trained language model, improving classification accuracy with limited data.
Contribution
It presents a novel unsupervised keyword expansion technique that generates topic-relevant words directly from input sentences without additional datasets or training.
Findings
Generated 3-10 relevant keywords per topic
Improved classification accuracy with limited training data
Effective in expanding short news headlines
Abstract
Short-text classification, like all data science, struggles to achieve high performance using limited data. As a solution, a short sentence may be expanded with new and relevant feature words to form an artificially enlarged dataset, and add new features to testing data. This paper applies a novel approach to text expansion by generating new words directly for each input sentence, thus requiring no additional datasets or previous training. In this unsupervised approach, new keywords are formed within the hidden states of a pre-trained language model and then used to create extended pseudo documents. The word generation process was assessed by examining how well the predicted words matched to topics of the input sentence. It was found that this method could produce 3-10 relevant new words for each target topic, while generating just 1 word related to each non-target topic. Generated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies · Natural Language Processing Techniques
