TL;DR
This paper introduces HV-MMBench, a comprehensive benchmark for evaluating multimodal large language models on human-centric video understanding across diverse tasks, data types, and temporal scales.
Contribution
It presents a new holistic benchmark with 13 tasks, multiple data formats, and broad video scenarios to better assess MLLMs' human-centric video understanding capabilities.
Findings
Benchmark covers 13 diverse tasks including attribute perception and cognitive reasoning.
Includes multiple data formats like multiple-choice and open-ended questions.
Evaluates models across 50 visual scenarios and videos from 10 seconds to 30 minutes.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios. Furthermore, they are often limited by single-question paradigms and overly simplistic evaluation metrics. To address above limitations, we propose a modern HV-MMBench, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. Compared to existing human-centric video benchmarks, our work offers the…
Peer Reviews
Decision·Submitted to ICLR 2026
The focus on a "human-centric" video understanding benchmark fills a potential gap in existing evaluations that may overly emphasize task-specific or non-human-perspective video analysis. Shifting the focus to human-centric comprehension introduces a novel angle for assessing video understanding models, which aligns with real-world scenarios where human relevance is crucial.
- Redundant question type design: Incorporating four question types (selection, fill-in-the-blank, judgment, open-ended) within a single benchmark is unnecessary. A benchmark's effectiveness lies in its ability to accurately and efficiently measure the target capability (human-centric video understanding). If one or two question types are most suitable for this purpose, prioritizing those would reduce complexity and focus the evaluation. The current multiplicity may dilute the benchmark's core v
1. The paper introduces HumanVideo-MME, a benchmark covering 13 human-centric tasks across perception and reasoning, offering unprecedented task diversity and evaluation formats. 2. Its automated annotation pipeline combining MLLMs and human validation ensures both scalability and data quality. 3. The evaluation compares multiple MLLMs using certain metrics, revealing concrete gaps between closed-form accuracy and genuine reasoning performance.
1. This is a nice to know study. But i question the main research value? The dataset construction, though large-scale, heavily relies on synthetic and pre-existing public datasets, limiting novelty in raw video acquisition. 2, The evaluation design should be more balanced: open-ended reasoning tasks remain small in sample size, potentially constraining generalizability of conclusions. 3. The study focuses solely on open-source MLLMs; inclusion of closed-source or proprietary baselines (e.g., G
- **Rigorous Construction**: Uses a semi-automated pipeline with both model-generated annotations and human review for quality control. - **Insightful Findings**: Reveals a significant performance gap between closed-form and generative tasks, highlighting a key weakness in current MLLMs.
- **Weak Justification for "Human-Centric" Focus**: The paper does not sufficiently explain why human-centric videos require a specialized benchmark beyond general video understanding. - **Unconvincing Metrics**: Some older/smaller models outperform newer/larger ones in certain tasks, suggesting possible flaws in metric design or benchmark construction. - **Limited Model Variety**: Only open-source models are tested; including proprietary models (e.g., GPT-4o, Gemini-2.5-Pro) could provide a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
