f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
Rajdeep Haldar, Lantao Mei, Guang Lin, Yue Xing, Qifan Song

TL;DR
This paper introduces divergence-based reinforcement learning algorithms, $f$-GRPO and $f$-HAL, for general language model alignment, effectively combining preference supervision and scalar reward feedback.
Contribution
It extends divergence-based alignment methods to reinforcement learning with scalar rewards, proposing new algorithms that improve reward optimization and safety in language models.
Findings
$f$-GRPO outperforms GRPO on math-reasoning RLVR tasks.
$f$-HAL reduces reward hacking in safety alignment scenarios.
The proposed objectives estimate $f$-divergences between aligned and unaligned distributions.
Abstract
Recent work shows that preference alignment objectives can be interpreted as divergence estimators between aligned (preferred) & unaligned (less-preferred) distributions, yielding a principled recipe for designing alignment losses. However, this view has so far been limited to preference-based supervision. We extend it to general LLM alignment, including reinforcement learning with verifiable rewards (RLVR), where alignment feedback is given only as scalar rewards. We introduce -Group Relative Policy Optimization (-GRPO), a class of on-policy RL objectives, and -Hybrid Alignment Loss (-HAL), which combines on-policy reward optimization with off-policy preference supervision. We show that these objectives estimate -divergences between reward-aligned & reward-unaligned distributions induced by above- & below-average reward responses, and prove expected reward improvement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
