MMATH: A Multilingual Benchmark for Mathematical Reasoning
Wenyang Luo, Wayne Xin Zhao, Jing Sha, Shijin Wang, Ji-Rong Wen

TL;DR
This paper introduces MMATH, a multilingual benchmark for complex mathematical reasoning across diverse languages, revealing performance disparities and proposing strategies to improve multilingual reasoning in large language models.
Contribution
The paper presents MMATH, the first comprehensive multilingual benchmark for complex reasoning, and explores methods to enhance multilingual reasoning capabilities of large models.
Findings
Models show significant performance gaps across languages.
Prompting and training strategies improve multilingual reasoning.
Language consistency issues are identified and addressed.
Abstract
The advent of large reasoning models, such as OpenAI o1 and DeepSeek R1, has significantly advanced complex reasoning tasks. However, their capabilities in multilingual complex reasoning remain underexplored, with existing efforts largely focused on simpler tasks like MGSM. To address this gap, we introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages. Using MMATH, we observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages. To address this, we explore strategies including prompting and training, demonstrating that reasoning in English and answering in target languages can simultaneously enhance performance and preserve target-language consistency. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Mathematics Education and Teaching Techniques
