Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model
Pankayaraj Pathmanathan, Furong Huang

TL;DR
This paper investigates the effectiveness of deliberative alignment in large language models, revealing persistent unsafe behaviors and proposing a BoN sampling method to improve safety with minimal utility loss.
Contribution
It demonstrates the limitations of deliberative alignment in fully mitigating unsafe behaviors and introduces a novel attribution-based sampling technique to enhance safety across multiple benchmarks.
Findings
Deliberative alignment reduces unsafe responses but does not eliminate them.
BoN sampling effectively attributes unsafe behavior to base models, improving safety.
Safety improvements persist even after reinforcement learning training.
Abstract
While the wide adoption of refusal training in large language models (LLMs) has showcased improvements in model safety, recent works have highlighted shortcomings due to the shallow nature of these alignment methods. To this end, the work on Deliberative alignment proposed distilling reasoning capabilities from stronger reasoning models, thereby instilling deeper safety in LLMs. In this work, we study the impact of deliberative alignment in language models. First, we show that despite being larger in model size and stronger in safety capability, there exists an alignment gap between teacher and student language models, which affects both the safety and general utility of the student model. Furthermore, we show that models aligned through deliberative alignment can retain unsafe behaviors from the base model despite learning the reasoning patterns of larger reasoning models. Building…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Pankayaraj/DA-SFT-MODEL-Qwen2.5-0.5B-Instruct-DATASET-STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Qwen-1.5Bmodel
- 🤗Pankayaraj/DA-SFT-MODEL-Qwen2.5-0.5B-Instruct-DATASET-STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Qwen-7Bmodel
- 🤗Pankayaraj/DA-SFT-MODEL-Qwen2.5-0.5B-Instruct-DATASET-STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Llama-8Bmodel
- 🤗Pankayaraj/DA-SFT-MODEL-Qwen2.5-0.5B-Instruct-DATASET-STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Qwen-14Bmodel
- 🤗Pankayaraj/DA-SFT-MODEL-Qwen2.5-0.5B-Instruct-DATASET-STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Qwen-32Bmodel
- 🤗Pankayaraj/DA-SFT-MODEL-Qwen2.5-0.5B-Instruct-DATASET-STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Llama-70Bmodel
- 🤗Pankayaraj/DA-SFT-MODEL-Qwen2.5-0.5B-Instruct-DATASET-STAR-41K-DA-Filtered-QwQ-32Bmodel
- 🤗Pankayaraj/DA-SFT-MODEL-Qwen2.5-1.5B-Instruct-DATASET-STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Qwen-1.5Bmodel
- 🤗Pankayaraj/DA-SFT-MODEL-Qwen2.5-1.5B-Instruct-DATASET-STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Qwen-7Bmodel
- 🤗Pankayaraj/DA-SFT-MODEL-Qwen2.5-1.5B-Instruct-DATASET-STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Llama-8Bmodel
- Pankayaraj/STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Llama-8Bdataset· 18 dl18 dl
- Pankayaraj/STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Qwen-14Bdataset· 21 dl21 dl
- Pankayaraj/STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Qwen-32Bdataset· 45 dl45 dl
- Pankayaraj/STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Qwen-1.5Bdataset· 20 dl20 dl
- Pankayaraj/STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Qwen-7Bdataset· 8 dl8 dl
- Pankayaraj/STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Llama-70Bdataset· 25 dl25 dl
- Pankayaraj/STAR-41K-DA-Filtered-QwQ-32Bdataset· 137 dl137 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
