Translation as a Scalable Proxy for Multilingual Evaluation

Sheriff Issaka; Erick Rosas Gonzalez; Lieqi Liu; Evans Kofi Agyei; Lucas Bandarkar; Nanyun Peng; David Ifeoluwa Adelani; Francisco Guzm\'an; Saadia Gabriel

arXiv:2601.11778·cs.CL·January 21, 2026

Translation as a Scalable Proxy for Multilingual Evaluation

Sheriff Issaka, Erick Rosas Gonzalez, Lieqi Liu, Evans Kofi Agyei, Lucas Bandarkar, Nanyun Peng, David Ifeoluwa Adelani, Francisco Guzm\'an, Saadia Gabriel

PDF

Open Access 2 Datasets

TL;DR

This paper demonstrates that translation quality can serve as an effective, inexpensive proxy for evaluating multilingual capabilities of large language models, addressing the challenge of benchmarking across thousands of languages.

Contribution

It introduces a scalable method to assess multilingual performance using translation quality, reducing reliance on extensive, language-specific benchmarks.

Findings

01

Translation performance strongly correlates with downstream task success.

02

Translation quality metrics achieve high correlation coefficients (e.g., 0.89-0.91) with multilingual understanding.

03

Translation as a proxy enables efficient screening of models for multilingual capabilities.

Abstract

The rapid proliferation of LLMs has created a critical evaluation paradox: while LLMs claim multilingual proficiency, comprehensive non-machine-translated benchmarks exist for fewer than 30 languages, leaving >98% of the world's 7,000 languages in an empirical void. Traditional benchmark construction faces scaling challenges such as cost, scarcity of domain experts, and data contamination. We evaluate the validity of a simpler alternative: can translation quality alone indicate a model's broader multilingual capabilities? Through systematic evaluation of 14 models (1B-72B parameters) across 9 diverse benchmarks and 7 translation metrics, we find that translation performance is a good indicator of downstream task success (e.g., Phi-4, median Pearson r: MetricX = 0.89, xCOMET = 0.91, SSA-COMET = 0.87). These results suggest that the representational abilities supporting faithful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling