Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models
Md Zarif Hossain, Ahmed Imteaj

TL;DR
Sim-CLIP is an unsupervised adversarial fine-tuning method that improves the robustness of CLIP's vision encoder against attacks while maintaining semantic quality, using a Siamese architecture with low computational cost.
Contribution
It introduces a novel Siamese adversarial training framework for CLIP that enhances robustness without large-batch contrastive learning or momentum encoders.
Findings
Outperforms existing robust CLIP variants in adversarial robustness.
Maintains or improves semantic fidelity under attack.
Requires low computational overhead for robust training.
Abstract
Vision-Language Models (VLMs) rely heavily on pretrained vision encoders to support downstream tasks such as image captioning, visual question answering, and zero-shot classification. Despite their strong performance, these encoders remain highly vulnerable to imperceptible adversarial perturbations, which can severely degrade both robustness and semantic quality in multimodal reasoning. In this work, we introduce Sim-CLIP, an unsupervised adversarial fine-tuning framework that enhances the robustness of the CLIP vision encoder while preserving overall semantic representations. Sim-CLIP adopts a Siamese training architecture with a cosine similarity objective and a symmetric stop-gradient mechanism to enforce semantic alignment between clean and adversarial views. This design avoids large-batch contrastive learning and additional momentum encoders, enabling robust training with low…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
