MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen, Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William, Yang Wang, Lijuan Wang, Xin Eric Wang

TL;DR
MMWorld introduces a comprehensive video benchmark for evaluating multimodal language models across multiple disciplines and reasoning tasks, highlighting current limitations and guiding future improvements.
Contribution
The paper presents MMWorld, a novel multi-discipline, multi-faceted video understanding benchmark with extensive datasets and evaluation protocols for assessing world model capabilities in videos.
Findings
MLLMs perform poorly on MMWorld, with GPT-4V achieving only 52.3% accuracy.
Models exhibit diverse skill sets compared to humans.
Significant room for improvement in multimodal video understanding.
Abstract
Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Video Surveillance and Tracking Methods
