HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric   Understanding

Keliang Li; Zaifei Yang; Jiahe Zhao; Hongze Shen; Ruibing Hou; Hong; Chang; Shiguang Shan; Xilin Chen

arXiv:2410.06777·cs.CV·October 10, 2024

HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding

Keliang Li, Zaifei Yang, Jiahe Zhao, Hongze Shen, Ruibing Hou, Hong, Chang, Shiguang Shan, Xilin Chen

PDF

Open Access

TL;DR

This paper introduces HERM-Bench, a new benchmark and dataset for evaluating and improving multimodal large language models' ability to understand human-centric scenarios, leading to a new model HERM-7B that outperforms existing models.

Contribution

The paper presents a novel benchmark, HERM-Bench, and a comprehensive dataset, HERM-100K, to enhance training and evaluation of MLLMs for human-centric understanding, along with a new model HERM-7B.

Findings

01

HERM-7B outperforms existing MLLMs on human-centric tasks.

02

Existing MLLMs have limitations in understanding complex human-centric scenarios.

03

Specialized datasets improve MLLMs' human-centric understanding.

Abstract

The significant advancements in visual understanding and instruction following from Multimodal Large Language Models (MLLMs) have opened up more possibilities for broader applications in diverse and universal human-centric scenarios. However, existing image-text data may not support the precise modality alignment and integration of multi-grained information, which is crucial for human-centric visual understanding. In this paper, we introduce HERM-Bench, a benchmark for evaluating the human-centric understanding capabilities of MLLMs. Our work reveals the limitations of existing MLLMs in understanding complex human-centric scenarios. To address these challenges, we present HERM-100K, a comprehensive dataset with multi-level human-centric annotations, aimed at enhancing MLLMs' training. Furthermore, we develop HERM-7B, a MLLM that leverages enhanced training data from HERM-100K.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques