Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
Suchin Gururangan, Dallas Card, Sarah K. Dreier, Emily K. Gade, Leroy, Z. Wang, Zeyu Wang, Luke Zettlemoyer, Noah A. Smith

TL;DR
This paper examines how language models' quality filters favor certain socio-economic and geographic language varieties, revealing biases and the need for more transparent data selection practices.
Contribution
It introduces a new dataset and analysis showing biases in quality filtering, highlighting the influence of language ideologies on training data for language models.
Findings
Larger, wealthier, urban schools' newspapers are more likely to be classified as high quality.
The quality filter's measurement does not align with factuality or literary merit.
Biases in data selection reflect underlying language ideologies.
Abstract
Language models increasingly rely on massive web dumps for diverse text data. However, these sources are rife with undesirable content. As such, resources like Wikipedia, books, and newswire often serve as anchors for automatically selecting web text most suitable for language modeling, a process typically referred to as quality filtering. Using a new dataset of U.S. high school newspaper articles -- written by students from across the country -- we investigate whose language is preferred by the quality filter used for GPT-3. We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality. We then demonstrate that the filter's measurement of quality is unaligned with other sensible metrics, such as factuality or literary acclaim. We argue that privileging any corpus as high quality entails a language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Hate Speech and Cyberbullying Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Byte Pair Encoding · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Weight Decay · Layer Normalization
