Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education

Mohamed Huti; Alasdair Mackintosh; Amy Waldock; Dominic Andrews; Maxime Leli\`evre; Moritz Boos; Tobias Murray; Paul Atherton; Robin A. A. Ince; Oliver G. B. Garrod

arXiv:2602.12196·cs.CL·February 13, 2026

Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education

Mohamed Huti, Alasdair Mackintosh, Amy Waldock, Dominic Andrews, Maxime Leli\`evre, Moritz Boos, Tobias Murray, Paul Atherton, Robin A. A. Ince, Oliver G. B. Garrod

PDF

Open Access

TL;DR

This paper introduces the Visual Reasoning Benchmark (VRB), a dataset of primary education visual problems to evaluate multimodal large language models' reasoning abilities, revealing strengths in static skills and limitations in dynamic spatial tasks.

Contribution

The paper presents the VRB dataset and methodology for assessing multimodal models on authentic classroom visual reasoning tasks, highlighting current capability gaps.

Findings

01

Models perform well on counting and scaling tasks.

02

Models struggle with folding, reflection, and rotation operations.

03

Weaknesses could impact classroom application and student assessment.

Abstract

AI models have achieved state-of-the-art results in textual reasoning; however, their ability to reason over spatial and relational structures remains a critical bottleneck -- particularly in early-grade maths, which relies heavily on visuals. This paper introduces the visual reasoning benchmark (VRB), a novel dataset designed to evaluate Multimodal Large Language Models (MLLMs) on their ability to solve authentic visual problems from classrooms. This benchmark is built on a set of 701 questions sourced from primary school examinations in Zambia and India, which cover a range of tasks such as reasoning by analogy, pattern completion, and spatial matching. We outline the methodology and development of the benchmark which intentionally uses unedited, minimal-text images to test if models can meet realistic needs of primary education. Our findings reveal a ``jagged frontier'' of capability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Intelligent Tutoring Systems and Adaptive Learning