Variation is the Norm: Embracing Sociolinguistics in NLP
Anne-Marie Lutgen, Alistair Plum, Verena Blaschke, Barbara Plank, Christoph Purschke

TL;DR
This paper advocates for integrating sociolinguistic variation into NLP research, demonstrating how embracing language variation, exemplified through Luxembourgish, can improve model robustness and performance.
Contribution
It introduces a framework combining sociolinguistics with NLP, emphasizing the importance of including language variation in model training and evaluation.
Findings
Models perform significantly worse on data with high orthographic variation.
Including variation in fine-tuning improves NLP model robustness.
Variation-aware models show better generalization across language forms.
Abstract
In Natural Language Processing (NLP), variation is typically seen as noise and "normalised away" before processing, even though it is an integral part of language. Conversely, studying language variation in social contexts is central to sociolinguistics. We present a framework to combine the sociolinguistic dimension of language with the technical dimension of NLP. We argue that by embracing sociolinguistics, variation can actively be included in a research setup, in turn informing the NLP side. To illustrate this, we provide a case study on Luxembourgish, an evolving language featuring a large amount of orthographic variation, demonstrating how NLP performance is impacted. The results show large discrepancies in the performance of models tested and fine-tuned on data with a large amount of orthographic variation in comparison to data closer to the (orthographic) standard. Furthermore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Computational and Text Analysis Methods · Language and cultural evolution
