Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model

Pankayaraj Pathmanathan; Furong Huang

arXiv:2604.09665·cs.LG·April 17, 2026

Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model

Pankayaraj Pathmanathan, Furong Huang

PDF

50 Models 7 Datasets

TL;DR

This paper investigates the effectiveness of deliberative alignment in large language models, revealing persistent unsafe behaviors and proposing a BoN sampling method to improve safety with minimal utility loss.

Contribution

It demonstrates the limitations of deliberative alignment in fully mitigating unsafe behaviors and introduces a novel attribution-based sampling technique to enhance safety across multiple benchmarks.

Findings

01

Deliberative alignment reduces unsafe responses but does not eliminate them.

02

BoN sampling effectively attributes unsafe behavior to base models, improving safety.

03

Safety improvements persist even after reinforcement learning training.

Abstract

While the wide adoption of refusal training in large language models (LLMs) has showcased improvements in model safety, recent works have highlighted shortcomings due to the shallow nature of these alignment methods. To this end, the work on Deliberative alignment proposed distilling reasoning capabilities from stronger reasoning models, thereby instilling deeper safety in LLMs. In this work, we study the impact of deliberative alignment in language models. First, we show that despite being larger in model size and stronger in safety capability, there exists an alignment gap between teacher and student language models, which affects both the safety and general utility of the student model. Furthermore, we show that models aligned through deliberative alignment can retain unsafe behaviors from the base model despite learning the reasoning patterns of larger reasoning models. Building…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.