SuperCLIP: CLIP with Simple Classification Supervision
Weiheng Zhao, Zilong Huang, Jiashi Feng, Xinggang Wang

TL;DR
SuperCLIP enhances CLIP's visual-text alignment by integrating token-level supervision through a simple classification augmentation, significantly improving performance across multiple vision-language tasks without extra data or substantial computational cost.
Contribution
SuperCLIP introduces a lightweight classification supervision framework that improves fine-grained alignment in CLIP models without additional annotated data or significant FLOPs increase.
Findings
Improves zero-shot classification accuracy
Enhances image-text retrieval performance
Mitigates small-batch training issues
Abstract
Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text, and this issue becomes even more pronounced when dealing with long and detailed captions. This stems from CLIP's training objective, which optimizes only global image-text similarity and overlooks token-level supervision - limiting its ability to achieve fine-grained visual-text alignment. To address this, we propose SuperCLIP, a simple yet effective framework that augments contrastive learning with classification-based supervision. By adding only a lightweight linear layer to the vision encoder, SuperCLIP leverages token-level cues to enhance visual-textual alignment - with just a 0.077% increase in total FLOPs, and no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
