TL;DR
BiCLIP introduces a simple, low-parameter framework that applies structured geometric transformations to improve cross-domain alignment in vision-language models, achieving state-of-the-art results across multiple benchmarks.
Contribution
The paper proposes BiCLIP, a novel method that leverages geometric transformations for domain canonicalization, enhancing zero-shot domain adaptation in vision-language models.
Findings
BiCLIP outperforms existing methods on 11 benchmarks.
The learned transformations exhibit orthogonality and specific angular distributions.
Structured geometric alignment is key to robust domain adaptation.
Abstract
Recent advances in vision-language models (VLMs) have demonstrated remarkable zero-shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few-shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross-modal alignment. Our approach is characterized by its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
