BaThe: Defense against the Jailbreak Attack in Multimodal Large Language   Models by Treating Harmful Instruction as Backdoor Trigger

Yulin Chen; Haoran Li; Yirui Zhang; Zihao Zheng; Yangqiu Song; Bryan; Hooi

arXiv:2408.09093·cs.CR·April 23, 2025

BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger

Yulin Chen, Haoran Li, Yirui Zhang, Zihao Zheng, Yangqiu Song, Bryan, Hooi

PDF

Open Access

TL;DR

BaThe is a defense mechanism for multimodal large language models that treats harmful instructions as backdoor triggers, effectively mitigating jailbreak attacks with minimal performance impact.

Contribution

It introduces BaThe, a novel backdoor trigger-based defense method that uses virtual rejection prompts embedded in soft text embeddings to defend against jailbreak attacks in MLLMs.

Findings

01

BaThe effectively mitigates various jailbreak attacks.

02

It is adaptable to unseen attack types.

03

Minimal impact on model performance.

Abstract

Multimodal Large Language Models (MLLMs) have showcased impressive performance in a variety of multimodal tasks. On the other hand, the integration of additional image modality may allow the malicious users to inject harmful content inside the images for jailbreaking. Unlike text-based LLMs, where adversaries need to select discrete tokens to conceal their malicious intent using specific algorithms, the continuous nature of image signals provides a direct opportunity for adversaries to inject harmful intentions. In this work, we propose $BaThe$ ( $Ba$ ckdoor $T$ rigger S $h$ i $e$ ld), a simple yet effective jailbreak defense mechanism. Our work is motivated by recent research on jailbreak backdoor attack and virtual prompt backdoor attack in generative language models. Jailbreak backdoor attack uses harmful instructions combined with manually…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning