Stealthy Jailbreak Attacks on Large Language Models via Benign Data   Mirroring

Honglin Mu; Han He; Yuxin Zhou; Yunlong Feng; Yang Xu; Libo Qin,; Xiaoming Shi; Zeming Liu; Xudong Han; Qi Shi; Qingfu Zhu; Wanxiang Che

arXiv:2410.21083·cs.CL·March 7, 2025

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

Honglin Mu, Han He, Yuxin Zhou, Yunlong Feng, Yang Xu, Libo Qin,, Xiaoming Shi, Zeming Liu, Xudong Han, Qi Shi, Qingfu Zhu, Wanxiang Che

PDF

Open Access 1 Video

TL;DR

This paper introduces a stealthy transfer attack method on large language models that uses benign data to train a mirror model, significantly reducing detectable malicious queries during jailbreak attempts.

Contribution

The authors propose a novel transfer attack approach that enhances stealth by training a mirror model with benign data, avoiding detection during the attack process.

Findings

01

Achieved up to 92% attack success rate on GPT-3.5 Turbo.

02

Reduced detectable jailbreak queries to an average of 1.5 per sample.

03

Demonstrated the need for stronger defenses against stealthy attacks.

Abstract

Large language model (LLM) safety is a critical issue, with numerous studies employing red team testing to enhance model security. Among these, jailbreak methods explore potential vulnerabilities by crafting malicious prompts that induce model outputs contrary to safety alignments. Existing black-box jailbreak methods often rely on model feedback, repeatedly submitting queries with detectable malicious instructions during the attack search process. Although these approaches are effective, the attacks may be intercepted by content moderators during the search process. We propose an improved transfer attack method that guides malicious prompt construction by locally training a mirror model of the target black-box model through benign data distillation. This method offers enhanced stealth, as it does not involve submitting identifiable malicious instructions to the target model during the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics · Hate Speech and Cyberbullying Detection

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Cosine Annealing · Byte Pair Encoding · Layer Normalization · Residual Connection · Multi-Head Attention · Softmax · {Dispute@FaQ-s}How to file a dispute with Expedia?