Classist Tools: Social Class Correlates with Performance in NLP
Amanda Cercas Curry, Giuseppe Attanasio, Zeerak Talat, Dirk Hovy

TL;DR
This paper demonstrates that NLP systems perform worse on language data from lower socioeconomic groups, highlighting the need to consider social class in developing fairer language technologies.
Contribution
It introduces a novel annotation of social class in a large corpus and empirically shows performance disparities in NLP tasks based on socioeconomic status.
Findings
NLP models perform significantly worse on lower socioeconomic groups.
Performance disparities are also observed across ethnicity and geographical language varieties.
The study advocates for including social class in NLP fairness considerations.
Abstract
Since the foundational work of William Labov on the social stratification of language (Labov, 1964), linguistics has made concentrated efforts to explore the links between sociodemographic characteristics and language production and perception. But while there is strong evidence for socio-demographic characteristics in language, they are infrequently used in Natural Language Processing (NLP). Age and gender are somewhat well represented, but Labov's original target, socioeconomic status, is noticeably absent. And yet it matters. We show empirically that NLP disadvantages less-privileged socioeconomic groups. We annotate a corpus of 95K utterances from movies with social class, ethnicity and geographical language variety and measure the performance of NLP systems on three tasks: language modelling, automatic speech recognition, and grammar error correction. We find significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsOnline Learning and Analytics
