Versatile Multi-Modal Pre-Training for Human-Centric Perception

Fangzhou Hong; Liang Pan; Zhongang Cai; Ziwei Liu

arXiv:2203.13815·cs.CV·March 28, 2022

Versatile Multi-Modal Pre-Training for Human-Centric Perception

Fangzhou Hong, Liang Pan, Zhongang Cai, Ziwei Liu

PDF

Open Access 1 Repo

TL;DR

HCMoCo is a versatile multi-modal pre-training framework for human-centric perception that effectively leverages diverse human data modalities and priors, improving downstream task performance especially in data-scarce scenarios.

Contribution

The paper introduces HCMoCo, a novel contrastive learning framework that hierarchically learns modal-invariant representations using dense and sparse contrastive objectives for multi-modal human data.

Findings

01

Significant improvements in DensePose Estimation and Human Parsing with 7.16% and 12% gains.

02

Effective cross-modality supervision and missing-modality inference demonstrated.

03

Versatility across multiple downstream tasks validated.

Abstract

Human-centric perception plays a vital role in vision and graphics. But their data annotations are prohibitively expensive. Therefore, it is desirable to have a versatile pre-train model that serves as a foundation for data-efficient downstream tasks transfer. To this end, we propose the Human-Centric Multi-Modal Contrastive Learning framework HCMoCo that leverages the multi-modal nature of human data (e.g. RGB, depth, 2D keypoints) for effective representation learning. The objective comes with two main challenges: dense pre-train for multi-modality data, efficient usage of sparse human priors. To tackle the challenges, we design the novel Dense Intra-sample Contrastive Learning and Sparse Structure-aware Contrastive Learning targets by hierarchically learning a modal-invariant latent space featured with continuous and ordinal feature distribution and structure-aware semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hongfz16/hcmoco
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsContrastive Learning