Losing our Tail, Again: (Un)Natural Selection & Multilingual LLMs
Eva Vanmassenhove

TL;DR
This paper discusses how the use of multilingual large language models can lead to the erosion of linguistic diversity through model collapse and self-reinforcing data loops, urging a reevaluation of NLP practices.
Contribution
It highlights the risk of linguistic flattening caused by model collapse in multilingual NLP and advocates for protecting expressive linguistic diversity.
Findings
Model collapse can distort data distribution and diminish low-probability linguistic features.
Self-consuming training loops lead to underrepresentation of linguistic diversity.
The paper calls for reimagining NLP to preserve multilingual expressiveness.
Abstract
Multilingual Large Language Models considerably changed how technologies influence language. While previous technologies could mediate or assist humans, there is now a tendency to offload the task of writing itself to these technologies, enabling models to change our languages more directly. While they provide us quick access to information and impressively fluent output, beneath their (apparent) sophistication lies a subtle, insidious threat: the gradual decline and loss of linguistic diversity. In this position paper, I explore how model collapse, with a particular focus on translation technology, can lead to the loss of linguistic forms, grammatical features, and cultural nuance. Model collapse refers to the consequences of self-consuming training loops, where automatically generated data (re-)enters the training data, leading to a gradual distortion of the data distribution and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
