Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs
Youssef Zaazou, Mark Thomas

TL;DR
This paper introduces a pre-training method for vision-language models that leverages linear additivity in embeddings to create background-invariant representations, significantly improving robustness against background biases.
Contribution
It presents a novel pre-training approach exploiting linear structure in VLMs to achieve background invariance without real-world debiased data, surpassing 90% worst-group accuracy on Waterbirds.
Findings
Achieves over 90% worst-group accuracy on Waterbirds with perfect spurious correlation
Demonstrates strong transfer from synthetic to real-world data
Requires no access to real-world debiased datasets
Abstract
Vision-language models (VLMs), such as CLIP and SigLIP 2, are widely used for image classification, yet their vision encoders remain vulnerable to systematic biases that undermine robustness. In particular, correlations between foreground objects and their backgrounds constitute a salient and practically important class of spurious dependencies. In this work, we revisit the well-known property of high linear additivity in VLM embedding spaces and show that it enables a decomposition of scene representations into foreground and background components. Leveraging this insight, we introduce a pre-training approach that exploits this property to construct background-invariant representations using synthetic data. Our method achieves, to our knowledge, the first worst-group accuracy exceeding on Waterbirds under perfect () spurious correlation (i.e., no minority-group examples…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
