Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue,, Bo Li, Ziwei Liu

TL;DR
Video-MMMU introduces a comprehensive benchmark to evaluate how well large multimodal models acquire and utilize knowledge from videos across multiple disciplines, highlighting current limitations and gaps in model learning capabilities.
Contribution
The paper presents Video-MMMU, a new multi-disciplinary benchmark with a knowledge gain metric to systematically assess LMMs' knowledge acquisition from videos.
Findings
LMMs' performance declines with increasing cognitive demands
Humans outperform models in knowledge acquisition tasks
Significant gap exists between human and model learning from videos
Abstract
Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, {\Delta}knowledge, quantifies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOnline and Blended Learning
