TL;DR
HumanVBench is a new benchmark with automated pipelines for evaluating nuanced human-centric video understanding in MLLMs, revealing significant gaps in current models' capabilities.
Contribution
We introduce a scalable, automated benchmark construction method and provide the first comprehensive evaluation of MLLMs on nuanced human-centric video tasks.
Findings
Current MLLMs struggle with subtle emotions and speech-visual alignment.
Even top models underperform compared to humans on HumanVBench.
Open-sourced benchmark and pipelines to foster future research.
Abstract
Evaluating the nuanced human-centric video understanding capabilities of Multimodal Large Language Models (MLLMs) remains a great challenge, as existing benchmarks often overlook the intricacies of emotion, behavior, and cross-modal alignment. We introduce HumanVBench, a comprehensive video benchmark designed to rigorously probe these capabilities across 16 fine-grained tasks. A cornerstone of our work is a novel and scalable benchmark construction methodology, featuring two automated pipelines that synthesize high-quality video annotations and challenging multiple-choice questions with minimal human labor. By leveraging state-of-the-art models for annotation and systematically converting model-induced errors into plausible distractors, our framework provides a generalizable ``machine'' for creating nuanced evaluation suites. Our extensive evaluation of 30 leading MLLMs on HumanVBench…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
