VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

Hanoona Rasheed; Abdelrahman Shaker; Anqi Tang; Muhammad Maaz; Ming-Hsuan Yang; Salman Khan; Fahad Shahbaz Khan

arXiv:2506.05349·cs.CV·June 25, 2025

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, Fahad Shahbaz Khan

PDF

Open Access 1 Datasets

TL;DR

VideoMathQA introduces a comprehensive benchmark for evaluating multimodal mathematical reasoning in videos, emphasizing the integration of visual, audio, and textual information over extended timeframes to assess model reasoning capabilities.

Contribution

This work presents VideoMathQA, a novel benchmark with diverse, multi-modal, and temporally extended mathematical questions, along with high-quality annotations to evaluate reasoning beyond perception.

Findings

01

Existing models struggle with multimodal, long-duration reasoning tasks.

02

The benchmark reveals significant gaps in current model capabilities.

03

Multi-step reasoning annotations enable detailed diagnosis of model performance.

Abstract

Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

MBZUAI/VideoMathQA
dataset· 461 dl
461 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Mathematics Education and Teaching Techniques · Intelligent Tutoring Systems and Adaptive Learning