Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

Jiawei Zhang; Andrew Estornell; David D. Baek; Bo Li; Xiaojun Xu

arXiv:2510.18081·cs.LG·October 22, 2025

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

Jiawei Zhang, Andrew Estornell, David D. Baek, Bo Li, Xiaojun Xu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Any-Depth Alignment (ADA), a simple inference-time method that enhances the safety of large language models by reintroducing assistant header tokens to prevent harmful outputs at any generation depth.

Contribution

ADA is a novel inference-time technique that reactivates innate safety priors in LLMs without modifying their parameters, significantly improving robustness against adversarial attacks.

Findings

01

ADA achieves near-100% refusal rate against adversarial prefill attacks.

02

ADA reduces success rates of prompt attacks to below 3%.

03

ADA maintains utility on benign tasks with minimal over-refusal.

Abstract

Large Language Models (LLMs) exhibit strong but shallow alignment: they directly refuse harmful queries when a refusal is expected at the very start of an assistant turn, yet this protection collapses once a harmful continuation is underway (either through the adversarial attacks or via harmful assistant-prefill attacks). This raises a fundamental question: Can the innate shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation depths? To achieve this goal, we propose Any-Depth Alignment (ADA), an effective inference-time defense with negligible overhead. ADA is built based on our observation that alignment is concentrated in the assistant header tokens through repeated use in shallow-refusal training, and these tokens possess the model's strong alignment priors. By reintroducing these tokens mid-stream, ADA induces the model to reassess harmfulness and recover…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

(1) Novel framing and insight: The paper advances a conceptually fresh hypothesis—that safety alignment priors already exist in model hidden states but remain “locked.” The discovery that assistant headers act as Safety Tokens is both empirically supported and intuitively plausible. (2) Strong empirical coverage: Evaluation spans 9 diverse model families, 4 harmfulness benchmarks (AdvBench, JailbreakBench, StrongReject, HEx-PHI), and multiple defense baselines. ADA (LP) outperforms both deep al

Weaknesses

(1) Mechanistic claims need deeper causal validation: While the authors show correlations between Safety-Token activations and refusal behavior, the causal mechanism remains partially speculative. The Transcoder neuron analysis is suggestive but does not yet establish that these neurons cause refusal rather than correlate with it. (2) Scope of evaluation is primarily safety-focused: The work measures safety robustness and over-refusal but does not test whether ADA affects reasoning quality, fac

Reviewer 02Rating 6Confidence 4

Strengths

1. The identification of assistant-header tokens as safety tokens that surface a strong, separable harmfulness signal is an insightful finding, which makes the proposed ADA well-motivated and empirically grounded. 2. The authors show strong empirical performance of ADA compared to existing methods: high refusal rates against harmful prompts, minimal over-refusal with benign prompts, and relatively efficient. Various robustness checks are also provided. 3. ADA doesn’t require fine-tuning and mode

Weaknesses

1. Leakage before cutoff: Because interventions trigger mid-stream, a small amount of harmful content can be emitted before the refusal fires; although the authors acknowledge this limitation, some quantification or examples of such leakage across tasks would strengthen the argument about the utility of ADA. 2. Dependence on the base model: the effectiveness of ADA, especially ADA-RK, fundamentally relies on the base model’s alignment strength (i.e., the model must already possess latent robust

Reviewer 03Rating 6Confidence 3

Strengths

Novelty and Significance: The paper introduces the important concept of "deep prefill attacks" to rigorously evaluate alignment depth and identifies a fundamental mechanism (unlocking innate alignment via Safety Tokens) rather than just proposing another fine-tuning method or external model. The idea of leveraging the model's own latent safety knowledge at inference time is highly significant. Effectiveness and Generality: ADA, especially the ADA-LP variant, demonstrates outstanding effectivene

Weaknesses

Reliance on Hidden State Access (ADA-LP): The most effective variant, ADA-LP, requires access to the model's internal hidden states. This limits its direct applicability to scenarios where users interact with closed APIs that only provide text outputs. While ADA-RK offers a training-free alternative for such cases, it is shown to be less consistently effective, particularly on models with weaker base alignment. Vulnerability in User-Controlled Environments: As acknowledged by the authors, ADA i

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)