Hulk: A Universal Knowledge Translator for Human-Centric Tasks

Yizhou Wang; Yixuan Wu; Weizhen He; Xun Guo; Feng Zhu; Lei Bai; Rui Zhao; Jian Wu; Tong He; Wanli Ouyang; Shixiang Tang

arXiv:2312.01697·cs.CV·August 7, 2025·5 cites

Hulk: A Universal Knowledge Translator for Human-Centric Tasks

Yizhou Wang, Yixuan Wu, Weizhen He, Xun Guo, Feng Zhu, Lei Bai, Rui Zhao, Jian Wu, Tong He, Wanli Ouyang, Shixiang Tang

PDF

Open Access 2 Repos

TL;DR

Hulk is a versatile multimodal model that unifies various human-centric perception tasks, including 2D/3D vision and vision-language tasks, without task-specific finetuning, achieving state-of-the-art results across multiple benchmarks.

Contribution

Hulk introduces a universal human-centric model with two general heads for diverse tasks, enabling modality translation and broad applicability without finetuning.

Findings

01

Achieves state-of-the-art performance on 11 out of 12 benchmarks.

02

Capable of handling 2D, 3D, skeleton, and vision-language tasks.

03

Demonstrates the effectiveness of unified modality translation for human-centric perception.

Abstract

Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems