Robust Policy Optimization to Prevent Catastrophic Forgetting

Mahdi Sabbaghi; George Pappas; Adel Javanmard; Hamed Hassani

arXiv:2602.08813·cs.LG·May 13, 2026

Robust Policy Optimization to Prevent Catastrophic Forgetting

Mahdi Sabbaghi, George Pappas, Adel Javanmard, Hamed Hassani

PDF

1 Repo

TL;DR

This paper introduces FRPO, a robust RLHF framework designed to enhance the stability of large language models against catastrophic forgetting during downstream fine-tuning by optimizing reward across a neighborhood of policies.

Contribution

FRPO is a novel method that ensures reward stability under policy shifts, reducing safety degradation and preserving task performance during downstream adaptation.

Findings

01

FRPO substantially reduces safety degradation across multiple models.

02

It preserves accuracy under subsequent fine-tuning.

03

No extra computation is required compared to existing methods.

Abstract

Large language models are commonly trained through multi-stage post-training: first via RLHF, then fine-tuned for other downstream objectives. Yet even small downstream updates can compromise earlier learned behaviors (e.g., safety), exposing a brittleness known as catastrophic forgetting. This suggests standard RLHF objectives do not guarantee robustness to future adaptation. To address it, most prior work designs downstream-time methods to preserve previously learned behaviors. We argue that preventing this requires pre-finetuning robustness: the base policy should avoid brittle high-reward solutions whose reward drops sharply under standard fine-tuning. We propose Fine-tuning Robust Policy Optimization (FRPO), a robust RLHF framework that optimizes reward not only at the current policy, but across a KL-bounded neighborhood of policies reachable by downstream adaptation. The key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

helloworld10011/FRPO
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.