Short-Text Classification Using Unsupervised Keyword Expansion

Duncan Cameron-Steinke

arXiv:1909.07512·cs.CL·September 18, 2019

Short-Text Classification Using Unsupervised Keyword Expansion

Duncan Cameron-Steinke

PDF

Open Access

TL;DR

This paper introduces an unsupervised method for expanding short texts with relevant keywords generated from a pre-trained language model, improving classification accuracy with limited data.

Contribution

It presents a novel unsupervised keyword expansion technique that generates topic-relevant words directly from input sentences without additional datasets or training.

Findings

01

Generated 3-10 relevant keywords per topic

02

Improved classification accuracy with limited training data

03

Effective in expanding short news headlines

Abstract

Short-text classification, like all data science, struggles to achieve high performance using limited data. As a solution, a short sentence may be expanded with new and relevant feature words to form an artificially enlarged dataset, and add new features to testing data. This paper applies a novel approach to text expansion by generating new words directly for each input sentence, thus requiring no additional datasets or previous training. In this unsupervised approach, new keywords are formed within the hidden states of a pre-trained language model and then used to create extended pseudo documents. The word generation process was assessed by examining how well the predicted words matched to topics of the input sentence. It was found that this method could produce 3-10 relevant new words for each target topic, while generating just 1 word related to each non-target topic. Generated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text and Document Classification Technologies · Natural Language Processing Techniques