Risk-aware Direct Preference Optimization under Nested Risk Measure

Lijun Zhang; Lin Li; Yajie Qi; Huizhong Song; Yaodong Yang; Jun Wang; Wei Wei

arXiv:2505.20359·cs.LG·May 30, 2025

Risk-aware Direct Preference Optimization under Nested Risk Measure

Lijun Zhang, Lin Li, Yajie Qi, Huizhong Song, Yaodong Yang, Jun Wang, Wei Wei

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Ra-DPO, a risk-aware optimization method for fine-tuning language models that balances alignment with human values and risk control using nested risk measures, outperforming existing approaches.

Contribution

Ra-DPO is a novel risk-aware optimization framework that incorporates nested risk measures into preference optimization for better risk management during model fine-tuning.

Findings

01

Outperforms existing methods in balancing alignment and risk.

02

Effective risk control demonstrated on multiple datasets.

03

Open-source implementation available.

Abstract

When fine-tuning pre-trained Large Language Models (LLMs) to align with human values and intentions, maximizing the estimated reward can lead to superior performance, but it also introduces potential risks due to deviations from the reference model's intended behavior. Most existing methods typically introduce KL divergence to constrain deviations between the trained model and the reference model; however, this may not be sufficient in certain applications that require tight risk control. In this paper, we introduce Risk-aware Direct Preference Optimization (Ra-DPO), a novel approach that incorporates risk-awareness by employing a class of nested risk measures. This approach formulates a constrained risk-aware advantage function maximization problem and then converts the Bradley-Terry model into a token-level representation. The objective function maximizes the likelihood of the policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zlj123-max/ra-dpo
pytorchOfficial

Videos

Risk-aware Direct Preference Optimization under Nested Risk Measure· slideslive

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Machine Learning and Data Classification · Multimodal Machine Learning Applications

MethodsALIGN