M3Kang: Evaluating Multilingual Multimodal Mathematical Reasoning in Vision-Language Models
Aleix Torres-Camps, Nathaniel Mitrani Hadida, V\'ictor Conchello Vendrell, \`Alex Batlle Casellas, Arnau Padr\'es Masdemont, Jordi Ros-Giralt

TL;DR
This paper introduces M3Kang, a large multilingual multimodal dataset for mathematical reasoning in vision-language models, revealing current models' limitations and the benefits of multilingual techniques, with extensive benchmarking and human comparison.
Contribution
The paper presents M3Kang, the first large-scale multilingual multimodal mathematical reasoning dataset, and demonstrates its utility in benchmarking and improving vision-language models.
Findings
Models struggle with basic math and diagram reasoning.
Multilingual techniques improve model performance.
Model performance correlates with language presence and size.
Abstract
Despite state-of-the-art vision-language models (VLMs) have demonstrated strong reasoning capabilities, their performance in multilingual mathematical reasoning remains underexplored, particularly when compared to human performance. To bridge this gap, we introduce M3Kang, the first massively multilingual, multimodal mathematical reasoning dataset for VLMs. It is derived from the Kangaroo Math Competition, the world's largest mathematics contest, which annually engages over six million participants under the age of 18 across more than 90 countries. M3Kang includes 1,747 unique multiple-choice problems organized by grade-level difficulty, with translations into 108 culturally diverse languages, some of them including diagrams essential for solving them. Using this dataset, we conduct extensive benchmarking on both closed- and open-source SOTA models. We observe that, despite recent…
Peer Reviews
Decision·Submitted to ICLR 2026
(1) M3Kang fills an important gap at the intersection of multilingual, multimodal, and mathematical reasoning. Prior datasets have addressed these dimensions separately; this benchmark enables joint evaluation. (2) The authors test about 10 VLMs, analyze correlations between accuracy and language Internet presence, compare text-only vs. figure-based questions, and benchmark multilingual reasoning methods. The analysis is thorough and supported by clear figures. (3) Using performance data from
(1) Section 2 omits key multilingual multimodal datasets such as EXAMS-V (ACL 2024), M4U (2024), and M3Exam (NeurIPS 2024), mentioning only M5 (Schneider & Sitaram 2024) without detailed comparison. A table contrasting coverage, modality, and translation strategy would strengthen the contribution claim. (2) The dataset originates from Catalan, chosen for data availability rather than linguistic suitability. The paper does not analyze potential bias or LLM performance limits in Catalan processin
1. M3Kang units multilingual, multimodal, and mathematical reasoning, enabling rigorous evaluation of VLMs. 2. The study benchmarks a diverse set of models, compares text-only vs. diagram-based performance, tests multilingual techniques, and includes human baselines, offering insights into VLM capabilities and limitations. 3. By leveraging real-world competition data and student performance, the benchmark has direct implications for educational AI development and multilingual model optimization.
1. The automated translation pipeline may introduce uneven quality across languages, particularly low-resource ones, and human translation (though resource-intensive) is not explored as a refinement. 2. Without detailed classification of problem types (e.g., geometry, arithmetic, logical reasoning), it is difficult to pinpoint specific reasoning components where VLMs fail most frequently.
- A multilingual and multimodal math benchmark with a reproducible pipeline. - This work offers a scalable data translation pipeline. - Comprehensive benchmarking across open and closed models.
- Reliance on backtranslation may systematically disadvantage low-resource languages - Cross-language fairness relies on filtered subsets, limited statistical testing of comparability. - Some models (Gemma) perform below chance; analysis of why (prompting, vision adapters) is shallow.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
