Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks
Md Zarif Hossain, Ahmed Imteaj

TL;DR
This paper introduces Sim-CLIP+, a robust vision encoder fine-tuned adversarially to defend large vision-language models against jailbreak and adversarial attacks, maintaining accuracy and enhancing security.
Contribution
We propose Sim-CLIP+, a novel adversarial fine-tuning method for CLIP's vision encoder that improves robustness without modifying existing LVLM architectures.
Findings
Sim-CLIP+ effectively defends against gradient-based adversarial attacks.
The method maintains high accuracy on clean datasets.
It significantly reduces vulnerability to jailbreak techniques.
Abstract
Large Vision-Language Models (LVLMs), trained on multimodal big datasets, have significantly advanced AI by excelling in vision-language tasks. However, these models remain vulnerable to adversarial attacks, particularly jailbreak attacks, which bypass safety protocols and cause the model to generate misleading or harmful responses. This vulnerability stems from both the inherent susceptibilities of LLMs and the expanded attack surface introduced by the visual modality. We propose Sim-CLIP+, a novel defense mechanism that adversarially fine-tunes the CLIP vision encoder by leveraging a Siamese architecture. This approach maximizes cosine similarity between perturbed and clean samples, facilitating resilience against adversarial manipulations. Sim-CLIP+ offers a plug-and-play solution, allowing seamless integration into existing LVLM architectures as a robust vision encoder. Unlike…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
MethodsContrastive Language-Image Pre-training
