MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation   in Videos

Xuehai He; Weixi Feng; Kaizhi Zheng; Yujie Lu; Wanrong Zhu; Jiachen; Li; Yue Fan; Jianfeng Wang; Linjie Li; Zhengyuan Yang; Kevin Lin; William; Yang Wang; Lijuan Wang; Xin Eric Wang

arXiv:2406.08407·cs.CV·July 31, 2024·1 cites

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen, Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William, Yang Wang, Lijuan Wang, Xin Eric Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

MMWorld introduces a comprehensive video benchmark for evaluating multimodal language models across multiple disciplines and reasoning tasks, highlighting current limitations and guiding future improvements.

Contribution

The paper presents MMWorld, a novel multi-discipline, multi-faceted video understanding benchmark with extensive datasets and evaluation protocols for assessing world model capabilities in videos.

Findings

01

MLLMs perform poorly on MMWorld, with GPT-4V achieving only 52.3% accuracy.

02

Models exhibit diverse skill sets compared to humans.

03

Significant room for improvement in multimodal video understanding.

Abstract

Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eric-ai-lab/mmworld
noneOfficial

Videos

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos· slideslive

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Video Surveillance and Tracking Methods