Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

Youssef Zaazou; Mark Thomas

arXiv:2605.11107·cs.CV·May 13, 2026

Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

Youssef Zaazou, Mark Thomas

PDF

TL;DR

This paper introduces a pre-training method for vision-language models that leverages linear additivity in embeddings to create background-invariant representations, significantly improving robustness against background biases.

Contribution

It presents a novel pre-training approach exploiting linear structure in VLMs to achieve background invariance without real-world debiased data, surpassing 90% worst-group accuracy on Waterbirds.

Findings

01

Achieves over 90% worst-group accuracy on Waterbirds with perfect spurious correlation

02

Demonstrates strong transfer from synthetic to real-world data

03

Requires no access to real-world debiased datasets

Abstract

Vision-language models (VLMs), such as CLIP and SigLIP 2, are widely used for image classification, yet their vision encoders remain vulnerable to systematic biases that undermine robustness. In particular, correlations between foreground objects and their backgrounds constitute a salient and practically important class of spurious dependencies. In this work, we revisit the well-known property of high linear additivity in VLM embedding spaces and show that it enables a decomposition of scene representations into foreground and background components. Leveraging this insight, we introduce a pre-training approach that exploits this property to construct background-invariant representations using synthetic data. Our method achieves, to our knowledge, the first worst-group accuracy exceeding $90%$ on Waterbirds under perfect ( $100%$ ) spurious correlation (i.e., no minority-group examples…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.