MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

Zikui Cai; Andrew Wang; Anirudh Satheesh; Ankit Nakhawa; Hyunwoo Jae; Keenan Powell; Minghui Liu; Neel Jay; Sungbin Oh; Xiyao Wang; Yongyuan Liang; Tom Goldstein; Furong Huang

arXiv:2506.05523·cs.CV·June 9, 2025

MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, Yongyuan Liang, Tom Goldstein, Furong Huang

PDF

Open Access 1 Repo 5 Datasets

TL;DR

MORSE-500 is a controllable, script-generated video benchmark designed to evaluate and stress-test multimodal reasoning across diverse, complex, and evolving scenarios, revealing significant gaps in current models.

Contribution

The paper introduces MORSE-500, a novel, programmatically generated video benchmark with adjustable difficulty, covering multiple reasoning categories to better evaluate multimodal intelligence.

Findings

01

State-of-the-art models perform poorly on MORSE-500, especially in abstract and planning tasks.

02

The benchmark's controllable generation pipeline allows systematic difficulty scaling.

03

MORSE-500 reveals substantial performance gaps in current multimodal reasoning systems.

Abstract

Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills -- including abstract, physical, planning, spatial, and temporal capabilities -- required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

morse-benchmark/morse-500-code
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Speech and dialogue systems

MethodsFocus