HumanVLM: Foundation for Human-Scene Vision-Language Model

Dawei Dai; Xu Long; Li Yutang; Zhang Yuanhui; Shuyin Xia

arXiv:2411.03034·cs.AI·November 6, 2024

HumanVLM: Foundation for Human-Scene Vision-Language Model

Dawei Dai, Xu Long, Li Yutang, Zhang Yuanhui, Shuyin Xia

PDF

Open Access 2 Datasets

TL;DR

This paper introduces HumanVLM, a large domain-specific vision-language model focused on human-scene understanding, trained on extensive human-centered datasets, and demonstrating superior performance in human-related tasks compared to similar models.

Contribution

The paper presents a new large-scale human-scene multimodal dataset and trains a specialized vision-language model that outperforms comparable models in human-centric tasks.

Findings

01

HumanVLM outperforms similar models like Qwen2VL and ChatGPT-4o in human-related tasks.

02

Created HumanCaption-10M and HumanCaptionHQ datasets for domain-specific training.

03

Demonstrated superior performance of HumanVLM across various downstream tasks.

Abstract

Human-scene vision-language tasks are increasingly prevalent in diverse social applications, yet recent advancements predominantly rely on models specifically tailored to individual tasks. Emerging research indicates that large vision-language models (VLMs) can enhance performance across various downstream vision-language understanding tasks. However, general-domain models often underperform in specialized fields. This study introduces a domain-specific Large Vision-Language Model, Human-Scene Vision-Language Model (HumanVLM), designed to provide a foundation for human-scene Vision-Language tasks. Specifically, (1) we create a large-scale human-scene multimodal image-text dataset (HumanCaption-10M) sourced from the Internet to facilitate domain-specific alignment; (2) develop a captioning approach for human-centered images, capturing human faces, bodies, and backgrounds, and construct a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques