HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation
Bingzi Zhang, Kaisi Guan, Ruihua Song

TL;DR
HuM-Eval is a new human-centric video evaluation framework that combines global assessment with detailed analysis of human motion to better match human preferences.
Contribution
It introduces a coarse-to-fine evaluation approach using vision-language models and human motion analysis, along with a new benchmark HuM-Bench for assessing human motion in videos.
Findings
HuM-Eval achieves 58.2% average human correlation, outperforming previous metrics.
The framework effectively combines global quality assessment with detailed human motion verification.
HuM-Bench provides a diverse dataset for evaluating human motion generation models.
Abstract
Video generation models have developed rapidly in recent years, where generating natural human motion plays a pivotal role. However, accurately evaluating the quality of generated human motion video remains a significant challenge. Existing evaluation metrics primarily focus on global scene statistics, often overlooking fine-grained human details and consequently failing to align with human subjective preference. To bridge this gap, we propose HuM-Eval, a novel human-centric evaluation framework that adopts a coarse-to-fine strategy. Specifically, our framework first utilizes a Vision Language Model to perform a coarse assessment of global video quality. It then proceeds to a fine-grained analysis, using 2D pose to verify anatomical correctness and 3D human motion to evaluate motion stability. Extensive experiments demonstrate that HuM-Eval achieves an average human correlation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
