Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

Tianyu Zhang; Zihang Xi; Jingyu Hua; Sheng Zhong

arXiv:2511.22044·cs.CR·December 1, 2025

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

Tianyu Zhang, Zihang Xi, Jingyu Hua, Sheng Zhong

PDF

Open Access

TL;DR

This paper explores the distillability of LLM security logic by developing a proxy model that predicts attack success rates, enabling more effective black-box jailbreak attacks through dense sampling and ranking regression.

Contribution

It introduces a novel framework combining outline filling attacks with ranking regression to predict attack success, demonstrating high accuracy in modeling jailbreak behaviors.

Findings

01

Proxy model achieves 91.1% accuracy in ranking responses

02

Predicts attack success rate with 69.2% accuracy

03

Confirms the predictability of jailbreak success patterns

Abstract

In the realm of black-box jailbreak attacks on large language models (LLMs), the feasibility of constructing a narrow safety proxy, a lightweight model designed to predict the attack success rate (ASR) of adversarial prompts, remains underexplored. This work investigates the distillability of an LLM's core security logic. We propose a novel framework that incorporates an improved outline filling attack to achieve dense sampling of the model's security boundaries. Furthermore, we introduce a ranking regression paradigm that replaces standard regression and trains the proxy model to predict which prompt yields a higher ASR. Experimental results show that our proxy model achieves an accuracy of 91.1 percent in predicting the relative ranking of average long response (ALR), and 69.2 percent in predicting ASR. These findings confirm the predictability and distillability of jailbreak…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Information and Cyber Security