BSO: Safety Alignment Is Density Ratio Matching

Tien-Phat Nguyen; Truong Nguyen; Thin Nguyen; Duy Minh Ho Nguyen; Ngoc-Thanh Dinh; Trung Le

arXiv:2605.12339·cs.LG·May 13, 2026

BSO: Safety Alignment Is Density Ratio Matching

Tien-Phat Nguyen, Truong Nguyen, Thin Nguyen, Duy Minh Ho Nguyen, Ngoc-Thanh Dinh, Trung Le

PDF

TL;DR

This paper introduces Bregman Safety Optimization (BSO), a principled, simple method for safety alignment in language models that improves safety-helpfulness trade-offs without complex procedures.

Contribution

The paper derives a closed-form decomposition of the optimal safe policy likelihood ratio, leading to a family of loss functions called BSO that simplifies safety alignment.

Findings

01

BSO consistently improves safety-helpfulness trade-offs across benchmarks.

02

It recovers existing safety-aware methods as special cases.

03

BSO requires no auxiliary models and has only one additional hyperparameter.

Abstract

Aligning language models for both helpfulness and safety typically requires complex pipelines-separate reward and cost models, online reinforcement learning, and primal-dual updates. Recent direct preference optimization approaches simplify training but incorporate safety through ad-hoc modifications such as multi-stage procedures or heuristic margin terms, lacking a principled derivation. We show that the likelihood ratio of the optimal safe policy admits a closed-form decomposition that reduces safety alignment to a density ratio matching problem. Minimizing Bregman divergences between the data and model ratios yields Bregman Safety Optimization (BSO), a family of single-stage loss functions, each induced by a convex generator, that provably recover the optimal safe policy. BSO is both general and simple: it requires no auxiliary models, introduces only one hyperparameter beyond…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.