Evaluating Dialect Robustness of Language Models via Conversation Understanding
Dipankar Srirag, Nihar Ranjan Sahoo, Aditya Joshi

TL;DR
This paper assesses how well large language models handle different English dialects, revealing biases towards US English and demonstrating that fine-tuning with dialectal data can improve dialect understanding.
Contribution
Introduces a novel evaluation framework for dialect robustness using conversation datasets and extends MD3 to create M-MD3 for dialect-specific testing.
Findings
LLMs perform better on US English than Indian English.
GPT models outperform smaller models, but fine-tuning improves smaller models' dialect understanding.
Fine-tuning with dialectal data enhances LLMs' dialect comprehension.
Abstract
With an evergrowing number of LLMs reporting superlative performance for English, their ability to perform equitably for different dialects of English (, dialect robustness) needs to be ascertained. Specifically, we use English language (US English or Indian English) conversations between humans who play the word-guessing game of 'taboo'. We formulate two evaluative tasks: target word prediction (TWP) (, predict the masked target word in a conversation) and target word selection (TWS) (, select the most likely masked target word in a conversation, from among a set of candidate words). Extending MD3, an existing dialectic dataset of taboo-playing conversations, we introduce M-MD3, a target-word-masked version of MD3 with the en-US and en-IN subsets. We create two subsets: en-MV (where en-US is transformed to include dialectal information) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
