Towards Robust Multimodal Large Language Models Against Jailbreak   Attacks

Ziyi Yin; Yuanpu Cao; Han Liu; Ting Wang; Jinghui Chen; Fenhlong Ma

arXiv:2502.00653·cs.CR·February 4, 2025

Towards Robust Multimodal Large Language Models Against Jailbreak Attacks

Ziyi Yin, Yuanpu Cao, Han Liu, Ting Wang, Jinghui Chen, Fenhlong Ma

PDF

Open Access

TL;DR

This paper introduces SafeMLLM, an adversarial training framework that enhances the robustness of multimodal large language models against jailbreak attacks by using contrastive embedding attacks and iterative model updates.

Contribution

SafeMLLM is the first to apply contrastive embedding attack-based adversarial training to improve MLLM robustness against jailbreak attacks.

Findings

01

SafeMLLM significantly reduces attack success rates across multiple models.

02

The method maintains high utility on benign inputs.

03

Robustness is improved against diverse jailbreak techniques.

Abstract

While multimodal large language models (MLLMs) have achieved remarkable success in recent advancements, their susceptibility to jailbreak attacks has come to light. In such attacks, adversaries exploit carefully crafted prompts to coerce models into generating harmful or undesirable content. Existing defense mechanisms often rely on external inference steps or safety alignment training, both of which are less effective and impractical when facing sophisticated adversarial perturbations in white-box scenarios. To address these challenges and bolster MLLM robustness, we introduce SafeMLLM by adopting an adversarial training framework that alternates between an attack step for generating adversarial noise and a model updating step. At the attack step, SafeMLLM generates adversarial perturbations through a newly proposed contrastive embedding attack (CoE-Attack), which optimizes token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Deception detection and forensic psychology · Digital and Cyber Forensics