MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

Tianyi Xu; Kosei Uemura; Alfred Malengo Kondoro; Tadesse Destaw Belay; Catherine Nana Nyaah Essuman; Ifeoma Okoh; Ganiyat Afolabi; Ayodele Awokoya; David Ifeoluwa Adelani

arXiv:2601.21225·cs.CL·April 29, 2026

MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

Tianyi Xu, Kosei Uemura, Alfred Malengo Kondoro, Tadesse Destaw Belay, Catherine Nana Nyaah Essuman, Ifeoma Okoh, Ganiyat Afolabi, Ayodele Awokoya, David Ifeoluwa Adelani

PDF

TL;DR

This paper introduces MGSM-Pro, a multilingual mathematical reasoning dataset with multiple instantiations per question, revealing significant robustness issues in models across languages and instantiations, and proposing improved evaluation practices.

Contribution

The paper extends the MGSM dataset with GSM-Symbolic inspired instantiations across nine languages, highlighting robustness challenges and recommending multiple instantiations for fair evaluation.

Findings

01

Models show large performance drops with different digit instantiations in low-resource languages.

02

Proprietary models like Gemini 2.5 Flash and GPT-4.1 are less robust to digit variations.

03

Open models GPT-OSS 120B and DeepSeek v3 demonstrate stronger robustness.

Abstract

Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a strong evidence of high variance when models are evaluated on different instantiations of the same question; however, the evaluation was conducted only in English. In this paper, we introduce MGSM-Pro, an extension of MGSM dataset with GSM-Symbolic approach. Our dataset provides five instantiations per MGSM question by varying names, digits and irrelevant context. Evaluations across nine languages reveal that many low-resource languages suffer large performance drops when tested on digit instantiations different from those in the original test set. We further find that models robustness in HRL setting do not necessarily translate to LRL. Moreover, proprietary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.