SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for Reinforcement Learning from Human Feedback (RLHF)

Dipan Maity

arXiv:2602.04651·cs.LG·February 10, 2026

SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for Reinforcement Learning from Human Feedback (RLHF)

Dipan Maity

PDF

Open Access

TL;DR

SAFE introduces a novel entropy-aware RLHF algorithm that improves training stability and reward performance over PPO by dynamically regulating KL divergence and employing a double critic for pessimistic value estimation.

Contribution

The paper presents SAFE, a new on-policy RLHF method with entropy-aware control, combining a double soft-min critic and adaptive KL regulation for stable, efficient alignment finetuning.

Findings

01

SAFE outperforms PPO with +5.15% reward improvement

02

SAFE exhibits negligible reward crashes during training

03

SAFE achieves superior KL control and stability

Abstract

Proximal Policy Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of Reinforcement Learning from Human Feedback (RLHF). PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner and suffers form reward oscillations, entropy collapse, value function drift, and sudden policy divergence that require frequent restarts and extensive hyperparameter tuning. In this paper, we develop a new pure on policy actor-critic RL method for the LM-RLHF setting. We present SAFE (Stable Alignment Finetuning with Entropy-aware control),a novel RLHF algorithm that combines a Double Soft-Min Critic for pessimistic value estimation with a new multi-layer stabilization framework combining entropy-gated KL regulation, and PID-controlled adaptive thresholds. Unlike standard PPO's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Real-Time Systems Scheduling · Model Reduction and Neural Networks