The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?

Chen Shani; Yuval Reif; Nathan Roll; Dan Jurafsky; Ekaterina Shutova

arXiv:2601.07220·cs.CL·April 13, 2026

The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?

Chen Shani, Yuval Reif, Nathan Roll, Dan Jurafsky, Ekaterina Shutova

PDF

TL;DR

This survey investigates whether performance gaps in multilingual language models are due to intrinsic linguistic complexity or modeling choices, highlighting how design decisions influence language fairness.

Contribution

It systematically analyzes linguistic features and modeling mechanisms, providing design recommendations to reduce disparities across diverse languages.

Findings

01

Normalization of segmentation, encoding, and data exposure reduces performance gaps.

02

Design choices significantly impact multilingual model fairness.

03

Insights link linguistic features to specific modeling mechanisms.

Abstract

Multilingual language models (LMs) promise broader NLP access, yet current systems deliver uneven performance across the world's languages. This survey examines why these gaps persist and whether they reflect intrinsic linguistic difficulty or modeling artifacts. We organize the literature around two questions: do linguistic disparities arise from representation and allocation choices (e.g., tokenization, encoding, data exposure, parameter sharing) rather than inherent complexity; and which design choices mitigate inequities across typologically diverse languages. We review linguistic features, such as orthography, morphology, lexical diversity, syntax, information density, and typological distance, linking each to concrete modeling mechanisms. Gaps often shrink when segmentation, encoding, and data exposure are normalized, suggesting much apparent difficulty stems from current modeling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.