In Search of the Successful Interpolation: On the Role of Sharpness in CLIP Generalization
Alireza Abdollahpoorrostam

TL;DR
This paper investigates how layer-wise sharpness, especially in straggler layers, influences the generalization of CLIP models during Robust Fine-Tuning, challenging traditional beliefs about flat minima and proposing new insights for model robustness.
Contribution
It introduces the concept of layer-wise sharpness in straggler layers as a reliable indicator of CLIP's OOD generalization during interpolation, and explores sparsity to improve robustness.
Findings
Layer-wise sharpness correlates with OOD generalization in CLIP.
Sharpness is not a reliable predictor of generalization for modern architectures.
Sparsity in straggler layers mitigates failure modes in RFT.
Abstract
\textit{Zero-shot} models like CLIP are often fine-tuned on a target dataset to improve its accuracy further, but this can compromise out-of-distribution (OOD) robustness. Robust Fine-Tuning (\texttt{RFT} )~\citep{wortsman2021robust}, which interpolates between the \textit{zero-shot} and \textit{fine-tuned} models, has been proposed to address this issue. However, understanding when \texttt{RFT} actually improves OOD error remains limited. In this work, we empirically investigate the robustness of \texttt{RFT} in CLIP models, with a focus on the \textit{sharpness} of the CLIP model during interpolation. First, we demonstrate that while sharpness may not serve as a reliable indicator for predicting the generalization of modern architectures like CLIP on OOD data, this challenges the conventional belief in the generalization benefits of flat minima in foundation models. However, by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLinguistic Studies and Language Acquisition · Mathematics, Computing, and Information Processing · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training · Focus
