Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing

Yinzhi Zhao; Ming Wang; Shi Feng; Xiaocui Yang; Daling Wang; Yifei Zhang

arXiv:2601.10543·cs.AI·February 2, 2026

Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing

Yinzhi Zhao, Ming Wang, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang

PDF

Open Access

TL;DR

This paper introduces a decoding-time safety-awareness probing method that detects unsafe content in large language models by leveraging latent safety signals, improving defense against jailbreak attacks without harming utility.

Contribution

The paper presents a novel decoding-time safety probing approach that activates intrinsic safety signals in LLMs to better detect and prevent jailbreak-generated unsafe content.

Findings

01

Significantly improves safety against jailbreak attacks

02

Maintains low false positive rates on benign inputs

03

Preserves response quality during detection

Abstract

Large language models (LLMs) have achieved impressive performance across natural language tasks and are increasingly deployed in real-world applications. Despite extensive safety alignment efforts, recent studies show that such alignment is often shallow and remains vulnerable to jailbreak attacks. Existing defense mechanisms, including decoding-based constraints and post-hoc content detectors, struggle against sophisticated jailbreaks, often intervening robust detection or excessively degrading model utility. In this work, we examine the decoding process of LLMs and make a key observation: even when successfully jailbroken, models internally exhibit latent safety-related signals during generation. However, these signals are overridden by the model's drive for fluent continuation, preventing timely self-correction or refusal. Building on this observation, we propose a simple yet…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection