Representation Learning for Short Text Clustering
Hui Yin, Xiangyu Song, Shuiqiao Yang, Guangyan Huang, Jianxin Li

TL;DR
This paper introduces two autoencoder-based methods to enhance short text representations derived from pre-trained models, significantly improving clustering performance by exploiting structural information and clustering constraints.
Contribution
It proposes novel unsupervised autoencoder techniques, STN-GAE and SCA-AE, to fine-tune pre-trained models for better short text clustering.
Findings
SCA-AE improves clustering accuracy by up to 14% over BERT alone.
Pre-trained models like BERT outperform traditional methods in short text clustering.
Tuning pre-trained representations with proposed methods enhances clustering performance.
Abstract
Effective representation learning is critical for short text clustering due to the sparse, high-dimensional and noise attributes of short text corpus. Existing pre-trained models (e.g., Word2vec and BERT) have greatly improved the expressiveness for short text representations with more condensed, low-dimensional and continuous features compared to the traditional Bag-of-Words (BoW) model. However, these models are trained for general purposes and thus are suboptimal for the short text clustering task. In this paper, we propose two methods to exploit the unsupervised autoencoder (AE) framework to further tune the short text representations based on these pre-trained text models for optimal clustering performance. In our first method Structural Text Network Graph Autoencoder (STN-GAE), we exploit the structural text information among the corpus by constructing a text network, and then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Text and Document Classification Technologies
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Dense Connections · Multi-Head Attention · Softmax · Linear Warmup With Linear Decay · Dropout · Attention Dropout · Weight Decay
