An Empirical Survey of Unsupervised Text Representation Methods on Twitter Data
Lili Wang, Chongyang Gao, Jason Wei, Weicheng Ma, Ruibo Liu, Soroush, Vosoughi

TL;DR
This paper empirically evaluates various unsupervised text representation methods on Twitter data, revealing that advanced models do not always outperform simpler ones in noisy, user-generated text clustering tasks.
Contribution
It provides a comprehensive experimental comparison of text representation techniques specifically on Twitter data, highlighting the need for further research in this area.
Findings
Advanced models do not always outperform simpler ones on Twitter data
Noisy user-generated text poses challenges for existing representation methods
Further exploration is needed for effective text representations in social media contexts
Abstract
The field of NLP has seen unprecedented achievements in recent years. Most notably, with the advent of large-scale pre-trained Transformer-based language models, such as BERT, there has been a noticeable improvement in text representation. It is, however, unclear whether these improvements translate to noisy user-generated text, such as tweets. In this paper, we present an experimental survey of a wide range of well-known text representation techniques for the task of text clustering on noisy Twitter data. Our results indicate that the more advanced models do not necessarily work best on tweets and that more exploration in this area is needed.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Linear Warmup With Linear Decay · WordPiece · Residual Connection · Multi-Head Attention · Adam · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Dropout
