SuperCLIP: CLIP with Simple Classification Supervision

Weiheng Zhao; Zilong Huang; Jiashi Feng; Xinggang Wang

arXiv:2512.14480·cs.CV·December 17, 2025

SuperCLIP: CLIP with Simple Classification Supervision

Weiheng Zhao, Zilong Huang, Jiashi Feng, Xinggang Wang

PDF

Open Access 1 Video

TL;DR

SuperCLIP enhances CLIP's visual-text alignment by integrating token-level supervision through a simple classification augmentation, significantly improving performance across multiple vision-language tasks without extra data or substantial computational cost.

Contribution

SuperCLIP introduces a lightweight classification supervision framework that improves fine-grained alignment in CLIP models without additional annotated data or significant FLOPs increase.

Findings

01

Improves zero-shot classification accuracy

02

Enhances image-text retrieval performance

03

Mitigates small-batch training issues

Abstract

Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text, and this issue becomes even more pronounced when dealing with long and detailed captions. This stems from CLIP's training objective, which optimizes only global image-text similarity and overlooks token-level supervision - limiting its ability to achieve fine-grained visual-text alignment. To address this, we propose SuperCLIP, a simple yet effective framework that augments contrastive learning with classification-based supervision. By adding only a lightweight linear layer to the vision encoder, SuperCLIP leverages token-level cues to enhance visual-textual alignment - with just a 0.077% increase in total FLOPs, and no…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SuperCLIP: CLIP with Simple Classification Supervision· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning