Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs

Xiang Li; Chong Zhang; Jia Wang; Fangyu Wu; Yushi Li; Xiaobo Jin

arXiv:2506.17231·cs.CL·December 23, 2025

Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs

Xiang Li, Chong Zhang, Jia Wang, Fangyu Wu, Yushi Li, Xiaobo Jin

PDF

TL;DR

This paper presents a novel framework called Adversarial Prompt Distillation that transfers jailbreaking capabilities from large language models to smaller models, enabling efficient and stealthy attacks with high success rates.

Contribution

The paper introduces a new method combining masked language modeling, reinforcement learning, and temperature control to effectively distill jailbreak skills into smaller language models.

Findings

01

Outperforms existing methods in attack success rate

02

Reduces resource requirements for jailbreak attacks

03

Demonstrates cross-model transferability of jailbreak capabilities

Abstract

As the scale and complexity of jailbreaking attacks on large language models (LLMs) continue to escalate, their efficiency and practical applicability are constrained, posing a profound challenge to LLM security. Jailbreaking techniques have advanced from manual prompt engineering to automated methodologies. Recent advances have automated jailbreaking approaches that harness LLMs to generate jailbreak instructions and adversarial examples, delivering encouraging results. Nevertheless, these methods universally include an LLM generation phase, which, due to the complexities of deploying and reasoning with LLMs, impedes effective implementation and broader adoption. To mitigate this issue, we introduce \textbf{Adversarial Prompt Distillation}, an innovative framework that integrates masked language modeling, reinforcement learning, and dynamic temperature control to distill LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.