Breaking the Limits of Open-Weight CLIP: An Optimization Framework for Self-supervised Fine-tuning of CLIP
Anant Mehta, Xiyuan Wei, Xingyu Chen, Tianbao Yang

TL;DR
This paper introduces TuneCLIP, a self-supervised fine-tuning framework that enhances open-weight CLIP models' performance across various tasks without retraining from scratch, addressing performance degradation issues.
Contribution
We propose TuneCLIP, a novel fine-tuning method with a warm-up stage and a contrastive loss adjustment, improving open-weight CLIP models' generalization and performance.
Findings
Achieves up to +2.5% on ImageNet and OOD benchmarks.
Improves SigLIP (ViT-B/16) performance without retraining.
Sets a new baseline for efficient post-pretraining adaptation.
Abstract
CLIP has become a cornerstone of multimodal representation learning, yet improving its performance typically requires a prohibitively costly process of training from scratch on billions of samples. We ask a different question: Can we improve the performance of open-weight CLIP models across various downstream tasks using only existing self-supervised datasets? Unlike supervised fine-tuning, which adapts a pretrained model to a single downstream task, our setting seeks to improve general performance across various tasks. However, as both our experiments and prior studies reveal, simply applying standard training protocols starting from an open-weight CLIP model often fails, leading to performance degradation. In this paper, we introduce TuneCLIP, a self-supervised fine-tuning framework that overcomes the performance degradation. TuneCLIP has two key components: (1) a warm-up stage of…
Peer Reviews
Decision·Submitted to ICLR 2026
- Addresses the important problem of improving existing CLIP models without expensive retraining - Provides formal analysis of cold-start bias with convergence guarantees - Tests across multiple architectures (OpenAI, SigLIP, LAION) and diverse benchmarks - The two-stage approach is well-motivated with good empirical evidence of the problems being solved
- Both OSR and HGCL are relatively straightforward modifications of existing techniques. OSR is essentially warmup with frozen parameters, while HGCL applies a standard hinge loss variant. - For SigLIP ViT-B/16, improvements are only +0.11% on retrieval and +1.15% on DataComp, raising questions about practical significance. - No comparison with other fine-tuning strategies like LoRA or prompt tuning - Limited ablation on hyperparameters (only margin m is studied)
The paper offers a novel and practical perspective on improving CLIP models by focusing on efficient post-pretraining enhancement rather than costly retraining. It systematically analyzes the causes of performance degradation during early fine-tuning and provides clear theoretical justification for the proposed solution. The two-stage optimization framework is conceptually sound and empirically validated through extensive experiments across multiple CLIP architectures.
The experiments primarily focus on single-object datasets, limiting the evaluation of TuneCLIP’s effectiveness on more complex, multi-object benchmarks such as COCO. All results are reported on ViT-B/16 models, making it difficult to assess scalability across larger or smaller architectures. Additionally, while the method is more efficient than full pretraining, the added warm-up stage introduces extra computation, and the paper does not quantify the actual wall-clock or resource savings.
1. **Clear practical problem with simple remedy:** The cold-start observation is concrete and likely familiar to practitioners; OSR is simple and easy to implement (run updates with frozen weights). The intuition is compelling and backed by analysis. 2. **Thorough experiments across models and data scales:** Results include multiple model checkpoints, DFN-12M/60M fine-tuning, ImageNet variants, retrieval (MSCOCO/Flickr30k), and the DataComp 38-dataset suite. Ablations isolate the contribution of
1. **Lack of practical GPU-cost analysis**: OSR requires running several epochs (paper reports E=5) over the full fine-tuning corpus to estimate statistics. For DFN-12M/60M, this is non-trivial. The paper lacks a clear breakdown of wall-clock cost (GPU-hours) of OSR + TuneCLIP versus, e.g., a baseline “just run FastCLIP for 5 epochs”. Provide explicit compute numbers (GPU-hours, FLOPs), and a cost/benefit curve (OSR epochs vs gain). 2. **More baselines and alternative fixes:** The paper compare
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
