Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression
Tianyu Zhang, Zihang Xi, Jingyu Hua, Sheng Zhong

TL;DR
This paper explores the distillability of LLM security logic by developing a proxy model that predicts attack success rates, enabling more effective black-box jailbreak attacks through dense sampling and ranking regression.
Contribution
It introduces a novel framework combining outline filling attacks with ranking regression to predict attack success, demonstrating high accuracy in modeling jailbreak behaviors.
Findings
Proxy model achieves 91.1% accuracy in ranking responses
Predicts attack success rate with 69.2% accuracy
Confirms the predictability of jailbreak success patterns
Abstract
In the realm of black-box jailbreak attacks on large language models (LLMs), the feasibility of constructing a narrow safety proxy, a lightweight model designed to predict the attack success rate (ASR) of adversarial prompts, remains underexplored. This work investigates the distillability of an LLM's core security logic. We propose a novel framework that incorporates an improved outline filling attack to achieve dense sampling of the model's security boundaries. Furthermore, we introduce a ranking regression paradigm that replaces standard regression and trains the proxy model to predict which prompt yields a higher ASR. Experimental results show that our proxy model achieves an accuracy of 91.1 percent in predicting the relative ranking of average long response (ALR), and 69.2 percent in predicting ASR. These findings confirm the predictability and distillability of jailbreak…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Information and Cyber Security
