Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts

Hee-Seon Kim; Minbeom Kim; Wonjun Lee; Kihyun Kim; Changick Kim

arXiv:2505.21556·cs.CV·May 29, 2025

Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts

Hee-Seon Kim, Minbeom Kim, Wonjun Lee, Kihyun Kim, Changick Kim

PDF

Open Access

TL;DR

This paper introduces Benign-to-Toxic jailbreaks for large vision-language models, which induce toxic responses from harmless prompts by optimizing adversarial images, revealing a new vulnerability in multimodal safety alignment.

Contribution

The paper proposes a novel Benign-to-Toxic (B2T) paradigm that effectively induces toxic outputs from benign inputs, outperforming prior methods and transferring in black-box settings.

Findings

01

B2T jailbreaks outperform prior approaches.

02

Effective in black-box transfer scenarios.

03

Reveal a new multimodal safety vulnerability.

Abstract

Optimization-based jailbreaks typically adopt the Toxic-Continuation setting in large vision-language models (LVLMs), following the standard next-token prediction objective. In this setting, an adversarial image is optimized to make the model predict the next token of a toxic prompt. However, we find that the Toxic-Continuation paradigm is effective at continuing already-toxic inputs, but struggles to induce safety misalignment when explicit toxic signals are absent. We propose a new paradigm: Benign-to-Toxic (B2T) jailbreak. Unlike prior work, we optimize adversarial images to induce toxic outputs from benign conditioning. Since benign conditioning contains no safety violations, the image alone must break the model's safety mechanisms. Our method outperforms prior approaches, transfers in black-box settings, and complements text-based jailbreaks. These results reveal an underexplored…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Hate Speech and Cyberbullying Detection