HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes

Keliang Li; Hongze Shen; Hao Shi; Ruibing Hou; Hong Chang; Jie Huang; Chenghao Jia; Wen Wang; Yiling Wu; Dongmei Jiang; Shiguang Shan; Xilin Chen

arXiv:2508.13692·cs.CV·August 20, 2025

HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes

Keliang Li, Hongze Shen, Hao Shi, Ruibing Hou, Hong Chang, Jie Huang, Chenghao Jia, Wen Wang, Yiling Wu, Dongmei Jiang, Shiguang Shan, Xilin Chen

PDF

3 Reviews

TL;DR

HumanPCR is a comprehensive evaluation suite designed to assess multimodal large language models' abilities in understanding, reasoning, and interpreting human-centric visual scenes across perception, comprehension, and reasoning levels, revealing significant challenges.

Contribution

We introduce HumanPCR, a novel benchmark with over 6,000 questions and video reasoning tasks to evaluate and analyze the human-centric visual understanding capabilities of multimodal models.

Findings

01

Models struggle with detailed space perception and temporal understanding.

02

Proactive visual evidence extraction remains challenging for current models.

03

Scaling context and test-time thinking offer limited improvements.

Abstract

The aspiration for artificial general intelligence, fueled by the rapid progress of multimodal models, demands human-comparable performance across diverse environments. We propose HumanPCR, an evaluation suite for probing MLLMs' capacity about human-related visual contexts across three hierarchical levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C, and Human-R, respectively). Human-P and Human-C feature over 6,000 human-verified multiple choice questions, assessing massive tasks of 9 dimensions, including but not limited to essential skills frequently overlooked by existing benchmarks. Human-R offers a challenging manually curated video reasoning test that requires integrating multiple visual evidences, proactively extracting context beyond question cues, and applying human-like expertise. Each question includes human-annotated Chain-of-Thought (CoT)…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. The paper proposes a fairly reasonable pipeline to curate complex VLM evaluation benchmarks across different difficulty levels (Figure 3). 2. The paper takes extra care to ensure that the quality and difficulty of the dataset is high. For instance, the questions solvable by blind LLM are filtered. In addition, the reasoning questions that one evidence, or lack of reasoning were not kept. 3. The paper tests a wide range of models in the open and closed world across several model scales. The

Weaknesses

1. The process of human participation is not very clear at different stages of the work. (a) who are domain experts and for what domains? (b) Where are these domain experts sourced from and how many of them were there? (c) Did the same folks also solve the task for human evaluation number in Table 3? (d) Is human eval taken from the same kind of domain experts that made the dataset or are they average human workers on some platform? 2. The writing-style and presentation could be improved. For i

Reviewer 02Rating 4Confidence 3

Strengths

The benchmark’s most distinctive contribution is Human-R’s explicit requirement for multi-evidence reasoning with at least one proactive, non-referred cue—an evaluation target largely missing from existing video QA benchmarks. This forces models beyond query-matching shortcuts and surfaces a realistic capability gap in long, complex, human-centric videos. The taxonomy across Human-P/C delivers both breadth and diagnostic depth; macro-averaging at task/dimension level makes failure modes actionab

Weaknesses

While Human-P/C are valuable and thorough, their task space overlaps with prior perception/comprehension benchmarks; the strongest novelty lies in Human-R. The paper would benefit from ablations demonstrating added diagnostic value of Human-P/C design (beyond coverage), or clearer evidence that specific sub-tasks reduce confounds seen in earlier datasets. Human-R’s reliance on an LLM judge, albeit validated, still risks metric drift as judges evolve; a small, fully human-adjudicated gold subset

Reviewer 03Rating 4Confidence 3

Strengths

1) A reasoning paradigm that enforces multi-evidence integration and proactive evidence seeking, with rigorous filtering to avoid shortcutting. 2) Most datasets just ask questions where the answer is directly visible in one frame or clip. This benchmark forces models to search multiple places in the video to figure things out—like humans would.

Weaknesses

1) The paper says annotations will be CC BY 4.0, but the DUA also says “non-commercial academic only.” CC BY allows commercial use—this must be reconciled (e.g., CC BY-NC for annotations, or remove NC language from DUA). 2) All answers are scored by a single LLM, even if correlations with humans are decent. To be rigorous, they should show confidence intervals, try multiple judges, and share prompts/seeds for transparency. 3) Only ~10% of Human-R questions were answered by actual humans. To make

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.