Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth
Jiawei Zhang, Andrew Estornell, David D. Baek, Bo Li, Xiaojun Xu

TL;DR
This paper introduces Any-Depth Alignment (ADA), a simple inference-time method that enhances the safety of large language models by reintroducing assistant header tokens to prevent harmful outputs at any generation depth.
Contribution
ADA is a novel inference-time technique that reactivates innate safety priors in LLMs without modifying their parameters, significantly improving robustness against adversarial attacks.
Findings
ADA achieves near-100% refusal rate against adversarial prefill attacks.
ADA reduces success rates of prompt attacks to below 3%.
ADA maintains utility on benign tasks with minimal over-refusal.
Abstract
Large Language Models (LLMs) exhibit strong but shallow alignment: they directly refuse harmful queries when a refusal is expected at the very start of an assistant turn, yet this protection collapses once a harmful continuation is underway (either through the adversarial attacks or via harmful assistant-prefill attacks). This raises a fundamental question: Can the innate shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation depths? To achieve this goal, we propose Any-Depth Alignment (ADA), an effective inference-time defense with negligible overhead. ADA is built based on our observation that alignment is concentrated in the assistant header tokens through repeated use in shallow-refusal training, and these tokens possess the model's strong alignment priors. By reintroducing these tokens mid-stream, ADA induces the model to reassess harmfulness and recover…
Peer Reviews
Decision·ICLR 2026 Poster
(1) Novel framing and insight: The paper advances a conceptually fresh hypothesis—that safety alignment priors already exist in model hidden states but remain “locked.” The discovery that assistant headers act as Safety Tokens is both empirically supported and intuitively plausible. (2) Strong empirical coverage: Evaluation spans 9 diverse model families, 4 harmfulness benchmarks (AdvBench, JailbreakBench, StrongReject, HEx-PHI), and multiple defense baselines. ADA (LP) outperforms both deep al
(1) Mechanistic claims need deeper causal validation: While the authors show correlations between Safety-Token activations and refusal behavior, the causal mechanism remains partially speculative. The Transcoder neuron analysis is suggestive but does not yet establish that these neurons cause refusal rather than correlate with it. (2) Scope of evaluation is primarily safety-focused: The work measures safety robustness and over-refusal but does not test whether ADA affects reasoning quality, fac
1. The identification of assistant-header tokens as safety tokens that surface a strong, separable harmfulness signal is an insightful finding, which makes the proposed ADA well-motivated and empirically grounded. 2. The authors show strong empirical performance of ADA compared to existing methods: high refusal rates against harmful prompts, minimal over-refusal with benign prompts, and relatively efficient. Various robustness checks are also provided. 3. ADA doesn’t require fine-tuning and mode
1. Leakage before cutoff: Because interventions trigger mid-stream, a small amount of harmful content can be emitted before the refusal fires; although the authors acknowledge this limitation, some quantification or examples of such leakage across tasks would strengthen the argument about the utility of ADA. 2. Dependence on the base model: the effectiveness of ADA, especially ADA-RK, fundamentally relies on the base model’s alignment strength (i.e., the model must already possess latent robust
Novelty and Significance: The paper introduces the important concept of "deep prefill attacks" to rigorously evaluate alignment depth and identifies a fundamental mechanism (unlocking innate alignment via Safety Tokens) rather than just proposing another fine-tuning method or external model. The idea of leveraging the model's own latent safety knowledge at inference time is highly significant. Effectiveness and Generality: ADA, especially the ADA-LP variant, demonstrates outstanding effectivene
Reliance on Hidden State Access (ADA-LP): The most effective variant, ADA-LP, requires access to the model's internal hidden states. This limits its direct applicability to scenarios where users interact with closed APIs that only provide text outputs. While ADA-RK offers a training-free alternative for such cases, it is shown to be less consistently effective, particularly on models with weaker base alignment. Vulnerability in User-Controlled Environments: As acknowledged by the authors, ADA i
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
