Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs
Jean-Charles Noirot Ferrand, Yohan Beugin, Eric Pauley, Ryan Sheatsley, Patrick McDaniel

TL;DR
This paper introduces a method to extract surrogate safety classifiers from aligned LLMs, enabling more efficient jailbreak attacks and revealing vulnerabilities in safety alignment mechanisms.
Contribution
The paper presents a novel technique for extracting surrogate classifiers from LLMs to improve jailbreak attack efficiency and transferability, exposing alignment vulnerabilities.
Findings
Surrogate classifiers achieve over 80% F1 score with only 20% of the model.
Attacks on surrogate classifiers transfer effectively to the LLM.
Using surrogate classifiers reduces attack resource requirements significantly.
Abstract
Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. In this paper, we introduce and evaluate a new technique for jailbreak attacks. We observe that alignment embeds a safety classifier in the LLM responsible for deciding between refusal and compliance, and seek to extract an approximation of this classifier: a surrogate classifier. To this end, we build candidate classifiers from subsets of the LLM. We first evaluate the degree to which candidate classifiers approximate the LLM's safety classifier in benign and adversarial settings. Then, we attack the candidates and measure how well the resulting adversarial inputs transfer to the LLM. Our evaluation shows that the best candidates achieve accurate agreement (an F1 score above 80%) using as little as 20%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Statistical and Computational Modeling · Artificial Intelligence in Law
MethodsLLaMA
