DefenSee: Dissecting Threat from Sight and Text -- A Multi-View Defensive Pipeline for Multi-modal Jailbreaks
Zihao Wang, Kar Wai Fok, Vrizlynn L. L. Thing

TL;DR
DefenSee is a multi-modal defense method that improves the robustness of large language models against jailbreaks by using cross-modal consistency checks, significantly reducing attack success rates while maintaining benign performance.
Contribution
This paper introduces DefenSee, a novel multi-modal black-box defense technique that enhances MLLM security against jailbreaks through image variant transcription and cross-modal consistency checks.
Findings
Reduces jailbreak attack success rate to below 1.70% on MiniGPT4.
Outperforms prior defenses in robustness while preserving benign task performance.
Effective against coordinated multi-modal jailbreaks.
Abstract
Multi-modal large language models (MLLMs), capable of processing text, images, and audio, have been widely adopted in various AI applications. However, recent MLLMs integrating images and text remain highly vulnerable to coordinated jailbreaks. Existing defenses primarily focus on the text, lacking robust multi-modal protection. As a result, studies indicate that MLLMs are more susceptible to malicious or unsafe instructions, unlike their text-only counterparts. In this paper, we proposed DefenSee, a robust and lightweight multi-modal black-box defense technique that leverages image variants transcription and cross-modal consistency checks, mimicking human judgment. Experiments on popular multi-modal jailbreak and benign datasets show that DefenSee consistently enhances MLLM robustness while better preserving performance on benign tasks compared to SOTA defenses. It reduces the ASR of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Topic Modeling
