MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models

Kangkang Wang; Qinting Jiang; Wanping Zhang; Bowen Ren; Shengzhao Wen

arXiv:2605.03485·cs.CV·May 6, 2026

MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models

Kangkang Wang, Qinting Jiang, Wanping Zhang, Bowen Ren, Shengzhao Wen

PDF

TL;DR

MHPR is a comprehensive benchmark designed to evaluate and improve large vision-language models' ability to understand and reason about human-centric scenes across multiple dimensions.

Contribution

The paper introduces MHPR, a new multidimensional benchmark with automated annotation pipeline, for advancing human perception and reasoning in vision-language models.

Findings

01

Format-aligned supervised fine-tuning data improves instruction following.

02

Reinforcement learning data enhances performance on difficult instances.

03

Training Qwen2.5-VL-7B with MHPR achieves near-parity with larger models.

Abstract

Multidimensional human understanding is essential for real-world applications such as film analysis and virtual digital humans, yet current LVLM benchmarks largely focus on single-task settings and lack fine-grained, human-centric evaluation. In this work, we introduce MHPR, a comprehensive benchmark for joint perception-reasoning over human-centric scenes spanning individual, multi-person, and human-object interaction dimensions. MHPR comprises a multi-level data design-Captioned Raw Data (C-RD), Supervised Fine-Tuning Data (SFT-D), Reinforcement Learning Data (RL-D), and Test Data (T-D)-together with an automated caption/VQA generation pipeline (ACVG) that performs category-wise attribute decomposition, attribute-specific rewriting, and multi-model voting to ensure high-quality, scalable annotations. We evaluate state-of-the-art vision-language models on fine-grained attributes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.