Hulk: A Universal Knowledge Translator for Human-Centric Tasks
Yizhou Wang, Yixuan Wu, Weizhen He, Xun Guo, Feng Zhu, Lei Bai, Rui Zhao, Jian Wu, Tong He, Wanli Ouyang, Shixiang Tang

TL;DR
Hulk is a versatile multimodal model that unifies various human-centric perception tasks, including 2D/3D vision and vision-language tasks, without task-specific finetuning, achieving state-of-the-art results across multiple benchmarks.
Contribution
Hulk introduces a universal human-centric model with two general heads for diverse tasks, enabling modality translation and broad applicability without finetuning.
Findings
Achieves state-of-the-art performance on 11 out of 12 benchmarks.
Capable of handling 2D, 3D, skeleton, and vision-language tasks.
Demonstrates the effectiveness of unified modality translation for human-centric perception.
Abstract
Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems
