PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment
Richa Verma, Bavish Kulur, Sanjay Chawla, Balaraman Ravindran

TL;DR
PREFINE is a novel preference-based fine-tuning method that enhances safety in reinforcement learning policies by reducing constraint violations while maintaining high rewards, using trajectory preferences.
Contribution
It adapts Direct Preference Optimization to continuous control, enabling safe policy fine-tuning with less data and computation.
Findings
Reduces constraint violations and failures by over 60%.
Maintains high reward performance after fine-tuning.
Improves data and computational efficiency over traditional methods.
Abstract
We address the problem of making a pre-trained reinforcement learning (RL) policy safety-aware by incorporating cost constraints without retraining it from scratch. While costs could be numerically encoded, we assume a more general setting is when costs are provided as preferences. Given a reward-optimized policy and a small dataset of preferred (low-cost) and dispreferred (high-cost) trajectories, our goal is to fine-tune the policy to generate low-cost behaviors while retaining high rewards. Unlike standard RLHF in language models, where preferences are defined over responses to the same prompt, our setting involves trajectory-level preferences in continuous control environments. We introduce PREFINE: Preference-based Implicit Reward and Cost Fine-Tuning for Safety Alignment which is a preference-based fine-tuning method that adapts Direct Preference Optimization (DPO), which is now…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
