Towards Robust Multimodal Large Language Models Against Jailbreak Attacks
Ziyi Yin, Yuanpu Cao, Han Liu, Ting Wang, Jinghui Chen, Fenhlong Ma

TL;DR
This paper introduces SafeMLLM, an adversarial training framework that enhances the robustness of multimodal large language models against jailbreak attacks by using contrastive embedding attacks and iterative model updates.
Contribution
SafeMLLM is the first to apply contrastive embedding attack-based adversarial training to improve MLLM robustness against jailbreak attacks.
Findings
SafeMLLM significantly reduces attack success rates across multiple models.
The method maintains high utility on benign inputs.
Robustness is improved against diverse jailbreak techniques.
Abstract
While multimodal large language models (MLLMs) have achieved remarkable success in recent advancements, their susceptibility to jailbreak attacks has come to light. In such attacks, adversaries exploit carefully crafted prompts to coerce models into generating harmful or undesirable content. Existing defense mechanisms often rely on external inference steps or safety alignment training, both of which are less effective and impractical when facing sophisticated adversarial perturbations in white-box scenarios. To address these challenges and bolster MLLM robustness, we introduce SafeMLLM by adopting an adversarial training framework that alternates between an attack step for generating adversarial noise and a model updating step. At the attack step, SafeMLLM generates adversarial perturbations through a newly proposed contrastive embedding attack (CoE-Attack), which optimizes token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Deception detection and forensic psychology · Digital and Cyber Forensics
