Towards Better Inclusivity: A Diverse Tweet Corpus of English Varieties
Nhi Pham, Lachlan Pham, Adam L. Meyers

TL;DR
This paper introduces a diverse Twitter corpus of 170,800 tweets from seven countries, annotated to reflect various English varieties, aiming to reduce bias in NLP tools by improving data representation of underrepresented English dialects.
Contribution
The paper presents a new annotated tweet dataset from multiple countries, focusing on underrepresented English varieties, and proposes a classification framework to measure standardness and linguistic differences.
Findings
Identified accuracy gaps in language identification for non-standard English varieties.
Created a large, regionally annotated tweet corpus to support bias reduction in NLP.
Highlighted the need for diverse data to improve NLP fairness and inclusivity.
Abstract
The prevalence of social media presents a growing opportunity to collect and analyse examples of English varieties. Whilst usage of these varieties was - and, in many cases, still is - used only in spoken contexts or hard-to-access private messages, social media sites like Twitter provide a platform for users to communicate informally in a scrapeable format. Notably, Indian English (Hinglish), Singaporean English (Singlish), and African-American English (AAE) can be commonly found online. These varieties pose a challenge to existing natural language processing (NLP) tools as they often differ orthographically and syntactically from standard English for which the majority of these tools are built. NLP models trained on standard English texts produced biased outcomes for users of underrepresented varieties. Some research has aimed to overcome the inherent biases caused by unrepresentative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLinguistics, Language Diversity, and Identity · Linguistic Variation and Morphology · Natural Language Processing Techniques
