TL;DR
HumanPCR is a comprehensive evaluation suite designed to assess multimodal large language models' abilities in understanding, reasoning, and interpreting human-centric visual scenes across perception, comprehension, and reasoning levels, revealing significant challenges.
Contribution
We introduce HumanPCR, a novel benchmark with over 6,000 questions and video reasoning tasks to evaluate and analyze the human-centric visual understanding capabilities of multimodal models.
Findings
Models struggle with detailed space perception and temporal understanding.
Proactive visual evidence extraction remains challenging for current models.
Scaling context and test-time thinking offer limited improvements.
Abstract
The aspiration for artificial general intelligence, fueled by the rapid progress of multimodal models, demands human-comparable performance across diverse environments. We propose HumanPCR, an evaluation suite for probing MLLMs' capacity about human-related visual contexts across three hierarchical levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C, and Human-R, respectively). Human-P and Human-C feature over 6,000 human-verified multiple choice questions, assessing massive tasks of 9 dimensions, including but not limited to essential skills frequently overlooked by existing benchmarks. Human-R offers a challenging manually curated video reasoning test that requires integrating multiple visual evidences, proactively extracting context beyond question cues, and applying human-like expertise. Each question includes human-annotated Chain-of-Thought (CoT)…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper proposes a fairly reasonable pipeline to curate complex VLM evaluation benchmarks across different difficulty levels (Figure 3). 2. The paper takes extra care to ensure that the quality and difficulty of the dataset is high. For instance, the questions solvable by blind LLM are filtered. In addition, the reasoning questions that one evidence, or lack of reasoning were not kept. 3. The paper tests a wide range of models in the open and closed world across several model scales. The
1. The process of human participation is not very clear at different stages of the work. (a) who are domain experts and for what domains? (b) Where are these domain experts sourced from and how many of them were there? (c) Did the same folks also solve the task for human evaluation number in Table 3? (d) Is human eval taken from the same kind of domain experts that made the dataset or are they average human workers on some platform? 2. The writing-style and presentation could be improved. For i
The benchmark’s most distinctive contribution is Human-R’s explicit requirement for multi-evidence reasoning with at least one proactive, non-referred cue—an evaluation target largely missing from existing video QA benchmarks. This forces models beyond query-matching shortcuts and surfaces a realistic capability gap in long, complex, human-centric videos. The taxonomy across Human-P/C delivers both breadth and diagnostic depth; macro-averaging at task/dimension level makes failure modes actionab
While Human-P/C are valuable and thorough, their task space overlaps with prior perception/comprehension benchmarks; the strongest novelty lies in Human-R. The paper would benefit from ablations demonstrating added diagnostic value of Human-P/C design (beyond coverage), or clearer evidence that specific sub-tasks reduce confounds seen in earlier datasets. Human-R’s reliance on an LLM judge, albeit validated, still risks metric drift as judges evolve; a small, fully human-adjudicated gold subset
1) A reasoning paradigm that enforces multi-evidence integration and proactive evidence seeking, with rigorous filtering to avoid shortcutting. 2) Most datasets just ask questions where the answer is directly visible in one frame or clip. This benchmark forces models to search multiple places in the video to figure things out—like humans would.
1) The paper says annotations will be CC BY 4.0, but the DUA also says “non-commercial academic only.” CC BY allows commercial use—this must be reconciled (e.g., CC BY-NC for annotations, or remove NC language from DUA). 2) All answers are scored by a single LLM, even if correlations with humans are decent. To be rigorous, they should show confidence intervals, try multiple judges, and share prompts/seeds for transparency. 3) Only ~10% of Human-R questions were answered by actual humans. To make
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
