Representation Learning for Short Text Clustering

Hui Yin; Xiangyu Song; Shuiqiao Yang; Guangyan Huang; Jianxin Li

arXiv:2109.09894·cs.CL·September 22, 2021

Representation Learning for Short Text Clustering

Hui Yin, Xiangyu Song, Shuiqiao Yang, Guangyan Huang, Jianxin Li

PDF

Open Access

TL;DR

This paper introduces two autoencoder-based methods to enhance short text representations derived from pre-trained models, significantly improving clustering performance by exploiting structural information and clustering constraints.

Contribution

It proposes novel unsupervised autoencoder techniques, STN-GAE and SCA-AE, to fine-tune pre-trained models for better short text clustering.

Findings

01

SCA-AE improves clustering accuracy by up to 14% over BERT alone.

02

Pre-trained models like BERT outperform traditional methods in short text clustering.

03

Tuning pre-trained representations with proposed methods enhances clustering performance.

Abstract

Effective representation learning is critical for short text clustering due to the sparse, high-dimensional and noise attributes of short text corpus. Existing pre-trained models (e.g., Word2vec and BERT) have greatly improved the expressiveness for short text representations with more condensed, low-dimensional and continuous features compared to the traditional Bag-of-Words (BoW) model. However, these models are trained for general purposes and thus are suboptimal for the short text clustering task. In this paper, we propose two methods to exploit the unsupervised autoencoder (AE) framework to further tune the short text representations based on these pre-trained text models for optimal clustering performance. In our first method Structural Text Network Graph Autoencoder (STN-GAE), we exploit the structural text information among the corpus by constructing a text network, and then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Text and Document Classification Technologies

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Dense Connections · Multi-Head Attention · Softmax · Linear Warmup With Linear Decay · Dropout · Attention Dropout · Weight Decay