LLM Safety Alignment is Divergence Estimation in Disguise

Rajdeep Haldar; Ziyi Wang; Qifan Song; Guang Lin; Yue Xing

arXiv:2502.00657·cs.LG·October 22, 2025

LLM Safety Alignment is Divergence Estimation in Disguise

Rajdeep Haldar, Ziyi Wang, Qifan Song, Guang Lin, Yue Xing

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a theoretical framework that interprets LLM alignment techniques as divergence estimators, explains safety separation in latent space, and proposes a new KL divergence-based alignment method validated through empirical experiments.

Contribution

It provides a novel divergence-based perspective on LLM alignment methods and introduces KLDO, a new divergence-based alignment technique with empirical validation.

Findings

01

Alignment methods can be viewed as divergence estimators.

02

KLDO improves safety alignment performance.

03

Distance metrics in prompt space correlate with model safety.

Abstract

We present a theoretical framework showing that popular LLM alignment methods, including RLHF and its variants, can be understood as divergence estimators between aligned (safe or preferred) and unaligned (harmful or less preferred) distributions. This perspective explains the emergence of separation in the latent space between safe and harmful prompts after alignment. As an application of our general divergence framework, we propose KLDO, a novel KL divergence-based alignment method, and empirically validate its effectiveness. We further show that using compliance-refusal datasets, rather than standard preference-based datasets, leads to stronger separation and improved safety alignment. Finally, to quantify the separation effect, we propose a distance-based metric in the prompt representation space, which also acts as a statistically significant indicator for model safety.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rhaldarpurdue/kldo
pytorchOfficial

Videos

LLM Safety Alignment is Divergence Estimation in Disguise· slideslive

Taxonomy

TopicsNuclear and radioactivity studies · Risk and Safety Analysis · Nuclear Materials and Properties