TL;DR
This paper introduces FRPO, a robust RLHF framework designed to enhance the stability of large language models against catastrophic forgetting during downstream fine-tuning by optimizing reward across a neighborhood of policies.
Contribution
FRPO is a novel method that ensures reward stability under policy shifts, reducing safety degradation and preserving task performance during downstream adaptation.
Findings
FRPO substantially reduces safety degradation across multiple models.
It preserves accuracy under subsequent fine-tuning.
No extra computation is required compared to existing methods.
Abstract
Large language models are commonly trained through multi-stage post-training: first via RLHF, then fine-tuned for other downstream objectives. Yet even small downstream updates can compromise earlier learned behaviors (e.g., safety), exposing a brittleness known as catastrophic forgetting. This suggests standard RLHF objectives do not guarantee robustness to future adaptation. To address it, most prior work designs downstream-time methods to preserve previously learned behaviors. We argue that preventing this requires pre-finetuning robustness: the base policy should avoid brittle high-reward solutions whose reward drops sharply under standard fine-tuning. We propose Fine-tuning Robust Policy Optimization (FRPO), a robust RLHF framework that optimizes reward not only at the current policy, but across a KL-bounded neighborhood of policies reachable by downstream adaptation. The key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
