RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models
Maya Varma, Jean-Benoit Delbrouck, Zhihong Chen, Akshay Chaudhari,, Curtis Langlotz

TL;DR
RaVL is a novel method that identifies and reduces spurious correlations in fine-tuned vision-language models by focusing on local image features, significantly improving zero-shot robustness across diverse models and domains.
Contribution
RaVL introduces a region-level clustering and a region-aware loss to discover and mitigate spurious correlations in VLMs, advancing fine-grained robustness techniques.
Findings
RaVL improves spurious correlation discovery by 191% over baselines.
RaVL enhances worst-group accuracy by 8.2%.
Qualitative results confirm effective mitigation of spurious correlations.
Abstract
Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal settings. In this work, we present RaVL, which takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features rather than operating at the global image level. Given a fine-tuned VLM, RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors. Then, RaVL mitigates the identified spurious correlation with a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · COVID-19 diagnosis using AI
MethodsFocus
