Which English Do LLMs Prefer? Triangulating Structural Bias Towards American English in Foundation Models
Mir Tafseer Nayeem, Davood Rafiei

TL;DR
This paper investigates the structural bias of large language models towards American English over British English, revealing systematic skew across data, tokenization, and model outputs, and discusses implications for linguistic diversity.
Contribution
It introduces DiAlign, a novel method for estimating dialectal alignment, and provides a comprehensive analysis of dialectal asymmetries in LLM development and deployment.
Findings
LLMs show a systematic bias towards American English across multiple stages
Tokenization analysis reveals higher segmentation costs for British English forms
Generative evaluations indicate a persistent preference for American English in model outputs
Abstract
Large language models (LLMs) are increasingly deployed in high-stakes domains, yet they expose only limited language settings, most notably "English (US)," despite the global diversity and colonial history of English. Through a postcolonial framing to explain the broader significance, we investigate how geopolitical histories of data curation, digital dominance, and linguistic standardization shape the LLM development pipeline. Focusing on two dominant standard varieties, American English (AmE) and British English (BrE), we construct a curated corpus of 1,813 AmE--BrE variants and introduce DiAlign, a dynamic, training-free method for estimating dialectal alignment using distributional evidence. We operationalize structural bias by triangulating evidence across three stages: (i) audits of six major pretraining corpora reveal systematic skew toward AmE, (ii) tokenizer analyses show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
