HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding

Yuxuan Cai; Jiangning Zhang; Zhenye Gan; Qingdong He; Xiaobin Hu; Junwei Zhu; Yabiao Wang; Chengjie Wang; Zhucun Xue; Chaoyou Fu; Xinwei He; Xiang Bai

arXiv:2507.04909·cs.CV·October 1, 2025

HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding

Yuxuan Cai, Jiangning Zhang, Zhenye Gan, Qingdong He, Xiaobin Hu, Junwei Zhu, Yabiao Wang, Chengjie Wang, Zhucun Xue, Chaoyou Fu, Xinwei He, Xiang Bai

PDF

3 Reviews

TL;DR

This paper introduces HV-MMBench, a comprehensive benchmark for evaluating multimodal large language models on human-centric video understanding across diverse tasks, data types, and temporal scales.

Contribution

It presents a new holistic benchmark with 13 tasks, multiple data formats, and broad video scenarios to better assess MLLMs' human-centric video understanding capabilities.

Findings

01

Benchmark covers 13 diverse tasks including attribute perception and cognitive reasoning.

02

Includes multiple data formats like multiple-choice and open-ended questions.

03

Evaluates models across 50 visual scenarios and videos from 10 seconds to 30 minutes.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios. Furthermore, they are often limited by single-question paradigms and overly simplistic evaluation metrics. To address above limitations, we propose a modern HV-MMBench, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. Compared to existing human-centric video benchmarks, our work offers the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

The focus on a "human-centric" video understanding benchmark fills a potential gap in existing evaluations that may overly emphasize task-specific or non-human-perspective video analysis. Shifting the focus to human-centric comprehension introduces a novel angle for assessing video understanding models, which aligns with real-world scenarios where human relevance is crucial.

Weaknesses

- Redundant question type design: Incorporating four question types (selection, fill-in-the-blank, judgment, open-ended) within a single benchmark is unnecessary. A benchmark's effectiveness lies in its ability to accurately and efficiently measure the target capability (human-centric video understanding). If one or two question types are most suitable for this purpose, prioritizing those would reduce complexity and focus the evaluation. The current multiplicity may dilute the benchmark's core v

Reviewer 02Rating 2Confidence 4

Strengths

1. The paper introduces HumanVideo-MME, a benchmark covering 13 human-centric tasks across perception and reasoning, offering unprecedented task diversity and evaluation formats. 2. Its automated annotation pipeline combining MLLMs and human validation ensures both scalability and data quality. 3. The evaluation compares multiple MLLMs using certain metrics, revealing concrete gaps between closed-form accuracy and genuine reasoning performance.

Weaknesses

1. This is a nice to know study. But i question the main research value? The dataset construction, though large-scale, heavily relies on synthetic and pre-existing public datasets, limiting novelty in raw video acquisition. 2, The evaluation design should be more balanced: open-ended reasoning tasks remain small in sample size, potentially constraining generalizability of conclusions. 3. The study focuses solely on open-source MLLMs; inclusion of closed-source or proprietary baselines (e.g., G

Reviewer 03Rating 0Confidence 5

Strengths

- **Rigorous Construction**: Uses a semi-automated pipeline with both model-generated annotations and human review for quality control. - **Insightful Findings**: Reveals a significant performance gap between closed-form and generative tasks, highlighting a key weakness in current MLLMs.

Weaknesses

- **Weak Justification for "Human-Centric" Focus**: The paper does not sufficiently explain why human-centric videos require a specialized benchmark beyond general video understanding. - **Unconvincing Metrics**: Some older/smaller models outperform newer/larger ones in certain tasks, suggesting possible flaws in metric design or benchmark construction. - **Limited Model Variety**: Only open-source models are tested; including proprietary models (e.g., GPT-4o, Gemini-2.5-Pro) could provide a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.