Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations
Chenyu You, Yifei Min, Weicheng Dai, Jasjeet S. Sekhon, Lawrence, Staib, James S. Duncan

TL;DR
This paper proposes a lightweight, group-robustness calibration method for CLIP that mitigates reliance on spurious features without needing group annotations, improving generalization across diverse tasks.
Contribution
It introduces a novel representation calibration approach using contrastive learning on a calibration set, enhancing group robustness without group labels.
Findings
Significant reduction in reliance on spurious features.
Improved generalization across multiple benchmarks.
Effective calibration without group annotations.
Abstract
Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features -- patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
