Cat-DPO: Category-Adaptive Safety Alignment

Tiankai Yang; Yi Nian; Xinyuan Li; Ruiyao Xu; Kaize Ding; Yue Zhao

arXiv:2604.17299·cs.CL·April 22, 2026

Cat-DPO: Category-Adaptive Safety Alignment

Tiankai Yang, Yi Nian, Xinyuan Li, Ruiyao Xu, Kaize Ding, Yue Zhao

PDF

TL;DR

Cat-DPO introduces a novel safety alignment method for large language models that uses per-category safety margins, improving overall helpfulness and safety consistency across harm categories.

Contribution

It proposes a category-adaptive optimization algorithm that dynamically adjusts safety margins for each harm category, addressing limitations of uniform safety approaches.

Findings

01

Improves aggregate helpfulness and harmlessness across models.

02

Reduces safety variance and worst-case safety gaps.

03

Offers a scalable, per-category safety refinement method.

Abstract

Aligning large language models with human preferences must balance two competing goals: responding helpfully to legitimate requests and reliably refusing harmful ones. Most preference-based safety alignment methods collapse safety into a single scalar that is applied uniformly to every preference pair. The result is a model that looks safe on average but stays relatively unsafe on a minority of harm categories. We cast safety alignment as a per-category constrained optimization problem and derive Cat-DPO, a direct-preference-optimization algorithm with a separate adaptive safety margin for each harm category. The margin tightens when the model still produces unsafe responses on a category and relaxes once the model catches up, so the training signal tracks each category's current difficulty rather than averaging under one global rate. Across two LLM backbones and six preference-learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.