Versatile Multi-Modal Pre-Training for Human-Centric Perception
Fangzhou Hong, Liang Pan, Zhongang Cai, Ziwei Liu

TL;DR
HCMoCo is a versatile multi-modal pre-training framework for human-centric perception that effectively leverages diverse human data modalities and priors, improving downstream task performance especially in data-scarce scenarios.
Contribution
The paper introduces HCMoCo, a novel contrastive learning framework that hierarchically learns modal-invariant representations using dense and sparse contrastive objectives for multi-modal human data.
Findings
Significant improvements in DensePose Estimation and Human Parsing with 7.16% and 12% gains.
Effective cross-modality supervision and missing-modality inference demonstrated.
Versatility across multiple downstream tasks validated.
Abstract
Human-centric perception plays a vital role in vision and graphics. But their data annotations are prohibitively expensive. Therefore, it is desirable to have a versatile pre-train model that serves as a foundation for data-efficient downstream tasks transfer. To this end, we propose the Human-Centric Multi-Modal Contrastive Learning framework HCMoCo that leverages the multi-modal nature of human data (e.g. RGB, depth, 2D keypoints) for effective representation learning. The objective comes with two main challenges: dense pre-train for multi-modality data, efficient usage of sparse human priors. To tackle the challenges, we design the novel Dense Intra-sample Contrastive Learning and Sparse Structure-aware Contrastive Learning targets by hierarchically learning a modal-invariant latent space featured with continuous and ordinal feature distribution and structure-aware semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsContrastive Learning
