BSO: Safety Alignment Is Density Ratio Matching
Tien-Phat Nguyen, Truong Nguyen, Thin Nguyen, Duy Minh Ho Nguyen, Ngoc-Thanh Dinh, Trung Le

TL;DR
This paper introduces Bregman Safety Optimization (BSO), a principled, simple method for safety alignment in language models that improves safety-helpfulness trade-offs without complex procedures.
Contribution
The paper derives a closed-form decomposition of the optimal safe policy likelihood ratio, leading to a family of loss functions called BSO that simplifies safety alignment.
Findings
BSO consistently improves safety-helpfulness trade-offs across benchmarks.
It recovers existing safety-aware methods as special cases.
BSO requires no auxiliary models and has only one additional hyperparameter.
Abstract
Aligning language models for both helpfulness and safety typically requires complex pipelines-separate reward and cost models, online reinforcement learning, and primal-dual updates. Recent direct preference optimization approaches simplify training but incorporate safety through ad-hoc modifications such as multi-stage procedures or heuristic margin terms, lacking a principled derivation. We show that the likelihood ratio of the optimal safe policy admits a closed-form decomposition that reduces safety alignment to a density ratio matching problem. Minimizing Bregman divergences between the data and model ratios yields Bregman Safety Optimization (BSO), a family of single-stage loss functions, each induced by a convex generator, that provably recover the optimal safe policy. BSO is both general and simple: it requires no auxiliary models, introduces only one hyperparameter beyond…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
