HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

Ting Zhou; Daoyuan Chen; Qirui Jiao; Bolin Ding; Yaliang Li; Ying Shen

arXiv:2412.17574·cs.CV·April 14, 2026

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

Ting Zhou, Daoyuan Chen, Qirui Jiao, Bolin Ding, Yaliang Li, Ying Shen

PDF

1 Repo

TL;DR

HumanVBench is a new benchmark with automated pipelines for evaluating nuanced human-centric video understanding in MLLMs, revealing significant gaps in current models' capabilities.

Contribution

We introduce a scalable, automated benchmark construction method and provide the first comprehensive evaluation of MLLMs on nuanced human-centric video tasks.

Findings

01

Current MLLMs struggle with subtle emotions and speech-visual alignment.

02

Even top models underperform compared to humans on HumanVBench.

03

Open-sourced benchmark and pipelines to foster future research.

Abstract

Evaluating the nuanced human-centric video understanding capabilities of Multimodal Large Language Models (MLLMs) remains a great challenge, as existing benchmarks often overlook the intricacies of emotion, behavior, and cross-modal alignment. We introduce HumanVBench, a comprehensive video benchmark designed to rigorously probe these capabilities across 16 fine-grained tasks. A cornerstone of our work is a novel and scalable benchmark construction methodology, featuring two automated pipelines that synthesize high-quality video annotations and challenging multiple-choice questions with minimal human labor. By leveraging state-of-the-art models for annotation and systematically converting model-induced errors into plausible distractors, our framework provides a generalizable ``machine'' for creating nuanced evaluation suites. Our extensive evaluation of 30 leading MLLMs on HumanVBench…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.