Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

Jean-Charles Noirot Ferrand; Yohan Beugin; Eric Pauley; Ryan Sheatsley; Patrick McDaniel

arXiv:2501.16534·cs.CR·February 19, 2026

Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

Jean-Charles Noirot Ferrand, Yohan Beugin, Eric Pauley, Ryan Sheatsley, Patrick McDaniel

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a method to extract surrogate safety classifiers from aligned LLMs, enabling more efficient jailbreak attacks and revealing vulnerabilities in safety alignment mechanisms.

Contribution

The paper presents a novel technique for extracting surrogate classifiers from LLMs to improve jailbreak attack efficiency and transferability, exposing alignment vulnerabilities.

Findings

01

Surrogate classifiers achieve over 80% F1 score with only 20% of the model.

02

Attacks on surrogate classifiers transfer effectively to the LLM.

03

Using surrogate classifiers reduces attack resource requirements significantly.

Abstract

Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. In this paper, we introduce and evaluate a new technique for jailbreak attacks. We observe that alignment embeds a safety classifier in the LLM responsible for deciding between refusal and compliance, and seek to extract an approximation of this classifier: a surrogate classifier. To this end, we build candidate classifiers from subsets of the LLM. We first evaluate the degree to which candidate classifiers approximate the LLM's safety classifier in benign and adversarial settings. Then, we attack the candidates and measure how well the resulting adversarial inputs transfer to the LLM. Our evaluation shows that the best candidates achieve accurate agreement (an F1 score above 80%) using as little as 20%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

jcnf/targeting-alignment
dataset· 75 dl
75 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Statistical and Computational Modeling · Artificial Intelligence in Law

MethodsLLaMA