HumanVLM: Foundation for Human-Scene Vision-Language Model
Dawei Dai, Xu Long, Li Yutang, Zhang Yuanhui, Shuyin Xia

TL;DR
This paper introduces HumanVLM, a large domain-specific vision-language model focused on human-scene understanding, trained on extensive human-centered datasets, and demonstrating superior performance in human-related tasks compared to similar models.
Contribution
The paper presents a new large-scale human-scene multimodal dataset and trains a specialized vision-language model that outperforms comparable models in human-centric tasks.
Findings
HumanVLM outperforms similar models like Qwen2VL and ChatGPT-4o in human-related tasks.
Created HumanCaption-10M and HumanCaptionHQ datasets for domain-specific training.
Demonstrated superior performance of HumanVLM across various downstream tasks.
Abstract
Human-scene vision-language tasks are increasingly prevalent in diverse social applications, yet recent advancements predominantly rely on models specifically tailored to individual tasks. Emerging research indicates that large vision-language models (VLMs) can enhance performance across various downstream vision-language understanding tasks. However, general-domain models often underperform in specialized fields. This study introduces a domain-specific Large Vision-Language Model, Human-Scene Vision-Language Model (HumanVLM), designed to provide a foundation for human-scene Vision-Language tasks. Specifically, (1) we create a large-scale human-scene multimodal image-text dataset (HumanCaption-10M) sourced from the Internet to facilitate domain-specific alignment; (2) develop a captioning approach for human-centered images, capturing human faces, bodies, and backgrounds, and construct a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
