PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining
Yuting Gao, Jinfeng Liu, Zihan Xu, Jun Zhang, Ke Li, Rongrong Ji,, Chunhua Shen

TL;DR
PyramidCLIP introduces a hierarchical feature alignment method for vision-language pretraining that improves semantic matching and data efficiency, outperforming existing models on multiple downstream tasks.
Contribution
The paper proposes PyramidCLIP, a hierarchical alignment framework that addresses semantic mismatches and negative sample constraints in vision-language pretraining.
Findings
PyramidCLIP outperforms CLIP on ImageNet zero-shot classification by over 10% with the same data.
It achieves state-of-the-art results on several downstream tasks with larger datasets.
PyramidCLIP improves data efficiency, surpassing CLIP with fewer training pairs.
Abstract
Large-scale vision-language pre-training has achieved promising results on downstream tasks. Existing methods highly rely on the assumption that the image-text pairs crawled from the Internet are in perfect one-to-one correspondence. However, in real scenarios, this assumption can be difficult to hold: the text description, obtained by crawling the affiliated metadata of the image, often suffers from the semantic mismatch and the mutual compatibility. To address these issues, we introduce PyramidCLIP, which constructs an input pyramid with different semantic levels for each modality, and aligns visual elements and linguistic elements in the form of hierarchy via peer-level semantics alignment and cross-level relation alignment. Furthermore, we soften the loss of negative samples (unpaired samples) so as to weaken the strict constraint during the pre-training stage, thus mitigating the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Machine Learning in Bioinformatics
MethodsContrastive Language-Image Pre-training
