Which English Do LLMs Prefer? Triangulating Structural Bias Towards American English in Foundation Models

Mir Tafseer Nayeem; Davood Rafiei

arXiv:2604.04204·cs.CL·April 7, 2026

Which English Do LLMs Prefer? Triangulating Structural Bias Towards American English in Foundation Models

Mir Tafseer Nayeem, Davood Rafiei

PDF

TL;DR

This paper investigates the structural bias of large language models towards American English over British English, revealing systematic skew across data, tokenization, and model outputs, and discusses implications for linguistic diversity.

Contribution

It introduces DiAlign, a novel method for estimating dialectal alignment, and provides a comprehensive analysis of dialectal asymmetries in LLM development and deployment.

Findings

01

LLMs show a systematic bias towards American English across multiple stages

02

Tokenization analysis reveals higher segmentation costs for British English forms

03

Generative evaluations indicate a persistent preference for American English in model outputs

Abstract

Large language models (LLMs) are increasingly deployed in high-stakes domains, yet they expose only limited language settings, most notably "English (US)," despite the global diversity and colonial history of English. Through a postcolonial framing to explain the broader significance, we investigate how geopolitical histories of data curation, digital dominance, and linguistic standardization shape the LLM development pipeline. Focusing on two dominant standard varieties, American English (AmE) and British English (BrE), we construct a curated corpus of 1,813 AmE--BrE variants and introduce DiAlign, a dynamic, training-free method for estimating dialectal alignment using distributional evidence. We operationalize structural bias by triangulating evidence across three stages: (i) audits of six major pretraining corpora reveal systematic skew toward AmE, (ii) tokenizer analyses show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.