Demographic Dialectal Variation in Social Media: A Case Study of African-American English
Su Lin Blodgett, Lisa Green, and Brendan O'Connor

TL;DR
This paper investigates African-American English on Twitter, proposing a model to identify AAE-like language, analyzing existing NLP tools' performance on it, and releasing a new corpus for further research.
Contribution
It introduces a distantly supervised model for detecting AAE in social media, evaluates NLP tools' performance on AAE, and provides a new annotated corpus of AAE-like tweets.
Findings
Existing NLP tools perform poorly on AAE-like text.
The proposed ensemble classifier improves language identification accuracy.
A new corpus of AAE-like tweets is released for future research.
Abstract
Though dialectal language is increasingly abundant on social media, few resources exist for developing NLP tools to handle such language. We conduct a case study of dialectal language in online conversational text by investigating African-American English (AAE) on Twitter. We propose a distantly supervised model to identify AAE-like language from demographics associated with geo-located messages, and we verify that this language follows well-known AAE linguistic phenomena. In addition, we analyze the quality of existing language identification and dependency parsing tools on AAE-like text, demonstrating that they perform poorly on such text compared to text associated with white speakers. We also provide an ensemble classifier for language identification which eliminates this disparity and release a new corpus of tweets containing AAE-like language.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
