PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model   Pretraining

Yuting Gao; Jinfeng Liu; Zihan Xu; Jun Zhang; Ke Li; Rongrong Ji,; Chunhua Shen

arXiv:2204.14095·cs.CV·May 31, 2022·42 cites

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Yuting Gao, Jinfeng Liu, Zihan Xu, Jun Zhang, Ke Li, Rongrong Ji,, Chunhua Shen

PDF

Open Access 1 Video

TL;DR

PyramidCLIP introduces a hierarchical feature alignment method for vision-language pretraining that improves semantic matching and data efficiency, outperforming existing models on multiple downstream tasks.

Contribution

The paper proposes PyramidCLIP, a hierarchical alignment framework that addresses semantic mismatches and negative sample constraints in vision-language pretraining.

Findings

01

PyramidCLIP outperforms CLIP on ImageNet zero-shot classification by over 10% with the same data.

02

It achieves state-of-the-art results on several downstream tasks with larger datasets.

03

PyramidCLIP improves data efficiency, surpassing CLIP with fewer training pairs.

Abstract

Large-scale vision-language pre-training has achieved promising results on downstream tasks. Existing methods highly rely on the assumption that the image-text pairs crawled from the Internet are in perfect one-to-one correspondence. However, in real scenarios, this assumption can be difficult to hold: the text description, obtained by crawling the affiliated metadata of the image, often suffers from the semantic mismatch and the mutual compatibility. To address these issues, we introduce PyramidCLIP, which constructs an input pyramid with different semantic levels for each modality, and aligns visual elements and linguistic elements in the form of hierarchy via peer-level semantics alignment and cross-level relation alignment. Furthermore, we soften the loss of negative samples (unpaired samples) so as to weaken the strict constraint during the pre-training stage, thus mitigating the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Machine Learning in Bioinformatics

MethodsContrastive Language-Image Pre-training