DialectLLM: A Dialect-Aware Dialog[ue] Generation Framework Beyond Standard American English
Jio Oh, Paul Vicinanza, Thomas Butler, Steven Euijong Whang, Dezhi Hong, Amani Namboori

TL;DR
DialectLLM introduces a large-scale, dialect-aware dialog dataset and benchmark, improving the authenticity of non-SAE dialect generation and highlighting current model limitations in dialect identification.
Contribution
The paper presents DialectLLM, a novel framework for creating authentic multi-dialectal dialog data and a benchmark for evaluating dialect understanding in LLMs.
Findings
Human evaluators prefer DialectLLM data over prior methods in 98.8% of comparisons.
Current LLMs achieve under 70% accuracy in dialect identification.
Models often misclassify non-SAE dialects as American or British.
Abstract
More than 80% of the 1.6B English speakers do not use Standard American English (SAE), yet LLMs often fail to correctly identify non-SAE dialects and generate stereotyped responses for their speakers. We introduce DialectLLM, the first large-scale framework for generating high-quality multi-dialectal conversational data encompassing the three pillars of written dialect -- lexical (vocabulary), orthographic (spelling), and morphosyntactic (grammar) features. DialectLLM produces a dialect-parallel dialog dataset spanning nine English dialects. Partnering with native linguists, we design and validate SAE-to-dialect transformation rules, ensuring authenticity. Our approach challenges the prevailing practice of applying a single morphosyntactic feature set to both user utterances and model responses, showing that models should not reproduce up to 90% of the grammatical features of a dialect.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
