SMILE: Evaluation and Domain Adaptation for Social Media Language Understanding
Vasilisa Bashlovkina, Riley Matthews, Zhaobin Kuang, Simon, Baumgartner, Michael Bendersky

TL;DR
This paper assesses transformer-based language models' ability to understand social media language, introduces a new benchmark called SMILE, and demonstrates that mixed pretraining improves performance on social media tasks.
Contribution
It introduces SMILE, a comprehensive benchmark for social media language understanding, and shows that mixed pretraining enhances model performance in this domain.
Findings
Social media language significantly differs from standard language in token distribution and linguistic shift.
Pretraining on both social media and conventional language improves model performance.
The proposed approach outperforms similar-sized models on the SMILE benchmark by 4.2 points.
Abstract
We study the ability of transformer-based language models (LMs) to understand social media language. Social media (SM) language is distinct from standard written language, yet existing benchmarks fall short of capturing LM performance in this socially, economically, and politically important domain. We quantify the degree to which social media language differs from conventional language and conclude that the difference is significant both in terms of token distribution and rate of linguistic shift. Next, we introduce a new benchmark for Social MedIa Language Evaluation (SMILE) that covers four SM platforms and eleven tasks. Finally, we show that learning a tokenizer and pretraining on a mix of social media and conventional language yields an LM that outperforms the best similar-sized alternative by 4.2 points on the overall SMILE score.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
