MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos

Arushi Goel; Sreyan Ghosh; Vatsal Agarwal; Nishit Anand; Kaousheik Jayakumar; Lasha Koroshinadze; Yao Xu; Katie Lyons; James Case; Karan Sapra; Kevin J. Shih; Siddharth Gururani; Abhinav Shrivastava; Ramani Duraiswami; Dinesh Manocha; Andrew Tao; Bryan Catanzaro; Mohammad Shoeybi; Wei Ping

arXiv:2603.14145·cs.CL·March 17, 2026

MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos

Arushi Goel, Sreyan Ghosh, Vatsal Agarwal, Nishit Anand, Kaousheik Jayakumar, Lasha Koroshinadze, Yao Xu, Katie Lyons, James Case, Karan Sapra, Kevin J. Shih, Siddharth Gururani, Abhinav Shrivastava, Ramani Duraiswami, Dinesh Manocha, Andrew Tao, Bryan Catanzaro

PDF

Open Access 2 Datasets

TL;DR

This paper introduces MMOU, a comprehensive benchmark for evaluating multimodal understanding and reasoning in long, complex videos across visual, audio, and textual modalities, revealing significant gaps in current models' capabilities.

Contribution

The paper presents MMOU, a large-scale, high-quality benchmark with diverse questions and videos, designed to systematically assess and analyze multimodal reasoning in real-world, long-form videos.

Findings

01

Current models perform poorly on long videos, with accuracy below 65%.

02

Models struggle to integrate evidence across modalities and over time.

03

Analysis reveals systematic failure modes in multimodal reasoning.

Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance in visual and audio understanding when evaluated in isolation. However, their ability to jointly reason over omni-modal (visual, audio, and textual) signals in long and complex videos remains largely unexplored. We introduce MMOU, a new benchmark designed to systematically evaluate multimodal understanding and reasoning under these challenging, real-world conditions. MMOU consists of 15,000 carefully curated questions paired with 9038 web-collected videos of varying length, spanning diverse domains and exhibiting rich, tightly coupled audio-visual content. The benchmark covers 13 fundamental skill categories, all of which require integrating evidence across modalities and time. All questions are manually annotated across multiple turns by professional annotators, ensuring high quality and reasoning fidelity. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Speech and Audio Processing