DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding
Mingxi Zou, Jiaxiang Chen, Junfan Li, Langzhang Liang, Qifan Wang, Xu Yinghui, Zenglin Xu

TL;DR
DARC is a novel inference-time method that improves alignment robustness by explicitly managing disagreement and risk during response selection, without retraining, using a KL-robust optimization approach.
Contribution
It introduces a retraining-free, risk-sensitive decoding method that explicitly accounts for disagreement, enhancing alignment robustness in language models.
Findings
DARC reduces disagreement and tail risk in responses.
Maintains competitive average quality under noisy feedback.
Provides explicit risk controls during inference.
Abstract
Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We propose **Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC)**, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making. Given multiple preference samples or scalable disagreement proxies, DARC reranks candidates by maximizing a *KL-robust (entropic)* satisfaction objective, and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining. We provide theoretical characterization linking this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Data Management and Algorithms · Recommender Systems and Techniques
