DialectLLM: A Dialect-Aware Dialog[ue] Generation Framework Beyond Standard American English

Jio Oh; Paul Vicinanza; Thomas Butler; Steven Euijong Whang; Dezhi Hong; Amani Namboori

arXiv:2601.22888·cs.CL·May 8, 2026

DialectLLM: A Dialect-Aware Dialog[ue] Generation Framework Beyond Standard American English

Jio Oh, Paul Vicinanza, Thomas Butler, Steven Euijong Whang, Dezhi Hong, Amani Namboori

PDF

TL;DR

DialectLLM introduces a large-scale, dialect-aware dialog dataset and benchmark, improving the authenticity of non-SAE dialect generation and highlighting current model limitations in dialect identification.

Contribution

The paper presents DialectLLM, a novel framework for creating authentic multi-dialectal dialog data and a benchmark for evaluating dialect understanding in LLMs.

Findings

01

Human evaluators prefer DialectLLM data over prior methods in 98.8% of comparisons.

02

Current LLMs achieve under 70% accuracy in dialect identification.

03

Models often misclassify non-SAE dialects as American or British.

Abstract

More than 80% of the 1.6B English speakers do not use Standard American English (SAE), yet LLMs often fail to correctly identify non-SAE dialects and generate stereotyped responses for their speakers. We introduce DialectLLM, the first large-scale framework for generating high-quality multi-dialectal conversational data encompassing the three pillars of written dialect -- lexical (vocabulary), orthographic (spelling), and morphosyntactic (grammar) features. DialectLLM produces a dialect-parallel dialog dataset spanning nine English dialects. Partnering with native linguists, we design and validate SAE-to-dialect transformation rules, ensuring authenticity. Our approach challenges the prevailing practice of applying a single morphosyntactic feature set to both user utterances and model responses, showing that models should not reproduce up to 90% of the grammatical features of a dialect.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.