Risk-aware Direct Preference Optimization under Nested Risk Measure
Lijun Zhang, Lin Li, Yajie Qi, Huizhong Song, Yaodong Yang, Jun Wang, Wei Wei

TL;DR
This paper introduces Ra-DPO, a risk-aware optimization method for fine-tuning language models that balances alignment with human values and risk control using nested risk measures, outperforming existing approaches.
Contribution
Ra-DPO is a novel risk-aware optimization framework that incorporates nested risk measures into preference optimization for better risk management during model fine-tuning.
Findings
Outperforms existing methods in balancing alignment and risk.
Effective risk control demonstrated on multiple datasets.
Open-source implementation available.
Abstract
When fine-tuning pre-trained Large Language Models (LLMs) to align with human values and intentions, maximizing the estimated reward can lead to superior performance, but it also introduces potential risks due to deviations from the reference model's intended behavior. Most existing methods typically introduce KL divergence to constrain deviations between the trained model and the reference model; however, this may not be sufficient in certain applications that require tight risk control. In this paper, we introduce Risk-aware Direct Preference Optimization (Ra-DPO), a novel approach that incorporates risk-awareness by employing a class of nested risk measures. This approach formulates a constrained risk-aware advantage function maximization problem and then converts the Bradley-Terry model into a token-level representation. The objective function maximizes the likelihood of the policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning and Data Classification · Multimodal Machine Learning Applications
MethodsALIGN
