Mind the Gap... or Not? How Translation Errors and Evaluation Details Skew Multilingual Results
Jan-Thorsten Peter, David Vilar, Tobias Domhan, Dan Malkin, Markus Freitag

TL;DR
This paper investigates the perceived performance gap of multilingual large language models in math tasks, revealing that translation errors and evaluation inconsistencies significantly skew results, and proposes solutions to address these issues.
Contribution
The paper identifies translation and evaluation issues in multilingual benchmarks and introduces a method for automatic quality assurance, which reduces the apparent language performance gap.
Findings
Translation errors inflate performance gaps across languages.
Standardized answer extraction impacts evaluation results.
Corrected dataset shows the language gap largely disappears.
Abstract
Most current large language models (LLMs) support a wide variety of languages in addition to English, including high-resource languages (e.g. German, Chinese, French), as well as low-resource ones (e.g. Swahili, Telugu). In addition they have also shown impressive capabilities in different domains, like coding, science and math. In this short paper, taking math as an example domain, we study the performance of different LLMs across languages. Experimental results show that there exists a non-negligible and consistent gap in the performance of the models across languages. Interestingly, and somewhat against expectations, the gap exists for both high- and low-resource languages. We hope that these results influence further research into cross-lingual capability generalization for next generation LLMs. If it weren't for the fact that they are false! By analyzing one of the standard…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The authors identified translation errors and ambiguities in MGSM that distort multilingual reasoning results. - The authors revealed that English-centric answer extraction misinterprets locale-specific number formats and non-Arabic digits. - They have cleaned the dataset and applied a language-aware parser, reducing the reported language gap to about 2.8%. And will provide with the corrected MGSM and evaluation tools for reliable multilingual benchmarking. - The authors urged the community
- The identified issue is not new, and was discovered before a different works: https://www.arxiv.org/pdf/2505.18978, https://arxiv.org/pdf/2507.05418, etc. - MGSM’s age and near-saturated scores suggest that some models may have seen the data during training, conflating reasoning ability with memorization. - The analysis focuses solely on MGSM, offering no direct evidence that similar issues affect other multilingual benchmarks. - The semi-automatic correction process depends on high-perform
This paper present a detailed analysis of MGSM and correct errors in it, and shows new results, which can be seen as a contribution for the community and bring some new empirical insights.
0. The paper is poorly formatted, I don't think it follow template of ICLR. 1. The paper is more like a technical analysis report done by an undergraduate student rather than a paper, there is not any new methods proposed in this paper. The only experiments results are performance of current SOTA LLMs on original MGSM and their corrected versions. I don't think correct errors in one existing benchmark can be seen as a novel method and written as a research paper.
The authors identify translation issues in MGSM and propose an automatic quality assurance methodology that effectively addresses these problems at scale, demonstrating strong practical value. Additionally, they provide an improved approach for parsing answers from LLM outputs.
1. The scope of the paper is relatively narrow, focusing primarily on the MGSM multilingual math benchmark. The main contribution lies in identifying translation errors and releasing an updated MGSM version, which may limit the overall impact. 2. The writing style lacks academic rigor in several places. For example, the use of exclamation marks in the abstract and introduction is inappropriate for a scholarly publication. 3. The evaluation should be expanded to include more multilingual datase
- The paper is clear and easy to understand - The data cleaning pipeline is thorough; not only were a plurality of top LLMs used, but human experts as well - There is a great level of attention to detail employed, both in terms of evaluation and in terms of going over the mistakes where even the English version is incorrect
While the paper addresses a relevant and valuable problem, I believe it may not fully meet the bar for ICLR. The focus on a single, moderately popular benchmark from three years ago limits the broader impact. Moreover, for leading models such as Gemini 2.5 Pro, GPT-5, Claude 3.7, and DeepSeek V3, the performance gap across most languages is under 10\%, with only one language showing a larger discrepancy. Although this work is of interest to the community, I'm not convinced that it passes the thr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
