Evaluating Visual Mathematics in Multimodal LLMs: A Multilingual Benchmark Based on the Kangaroo Tests

Arnau Igualde S\'aez; Lamyae Rhomrasi; Yusef Ahsini; Ricardo Vinuesa; Sergio Hoyas; Jose P. Garc\'ia Sabater; Marius J. Fullana i Alfonso; J. Alberto Conejero

arXiv:2506.07418·cs.AI·June 10, 2025

Evaluating Visual Mathematics in Multimodal LLMs: A Multilingual Benchmark Based on the Kangaroo Tests

Arnau Igualde S\'aez, Lamyae Rhomrasi, Yusef Ahsini, Ricardo Vinuesa, Sergio Hoyas, Jose P. Garc\'ia Sabater, Marius J. Fullana i Alfonso, J. Alberto Conejero

PDF

Open Access

TL;DR

This paper evaluates the mathematical reasoning capabilities of multimodal large language models across multiple languages and mathematical domains, revealing moderate performance and highlighting areas for improvement in diagram understanding and reasoning.

Contribution

It introduces a multilingual benchmark for visual mathematics in MLLMs and provides a comprehensive analysis of several models' reasoning abilities and limitations.

Findings

01

Gemini 2.0 Flash achieves highest accuracy on image tasks.

02

Models perform better on easier questions but struggle with advanced reasoning.

03

Significant variation exists across languages and difficulty levels.

Abstract

Multimodal Large Language Models (MLLMs) promise advanced vision language capabilities, yet their effectiveness in visually presented mathematics remains underexplored. This paper analyzes the development and evaluation of MLLMs for mathematical problem solving, focusing on diagrams, multilingual text, and symbolic notation. We then assess several models, including GPT 4o, Pixtral, Qwen VL, Llama 3.2 Vision variants, and Gemini 2.0 Flash in a multilingual Kangaroo style benchmark spanning English, French, Spanish, and Catalan. Our experiments reveal four key findings. First, overall precision remains moderate across geometry, visual algebra, logic, patterns, and combinatorics: no single model excels in every topic. Second, while most models see improved accuracy with questions that do not have images, the gain is often limited; performance for some remains nearly unchanged without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpatial Cognition and Navigation · Cognitive and developmental aspects of mathematical skills · Visual and Cognitive Learning Processes