BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision-Language Models
Sunny Gupta, Shounak Das, Amit Sethi

TL;DR
BiPrompt introduces a bilateral prompt optimization method that simultaneously debiases visual and textual modalities in vision-language models, improving robustness and domain invariance without retraining.
Contribution
It proposes a novel framework combining attention-guided erasure and balanced prompt normalization for effective test-time debiasing in both modalities.
Findings
Consistent accuracy improvements on real-world and synthetic benchmarks.
Enhanced robustness against spurious correlations and distribution shifts.
Achieves domain-invariant reasoning without retraining.
Abstract
Vision language foundation models such as CLIP exhibit impressive zero-shot generalization yet remain vulnerable to spurious correlations across visual and textual modalities. Existing debiasing approaches often address a single modality either visual or textual leading to partial robustness and unstable adaptation under distribution shifts. We propose a bilateral prompt optimization framework (BiPrompt) that simultaneously mitigates non-causal feature reliance in both modalities during test-time adaptation. On the visual side, it employs structured attention-guided erasure to suppress background activations and enforce orthogonal prediction consistency between causal and spurious regions. On the textual side, it introduces balanced prompt normalization, a learnable re-centering mechanism that aligns class embeddings toward an isotropic semantic space. Together, these modules jointly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
