f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment

Rajdeep Haldar; Lantao Mei; Guang Lin; Yue Xing; Qifan Song

arXiv:2602.05946·cs.LG·May 12, 2026

f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment

Rajdeep Haldar, Lantao Mei, Guang Lin, Yue Xing, Qifan Song

PDF

TL;DR

This paper introduces divergence-based reinforcement learning algorithms, $f$-GRPO and $f$-HAL, for general language model alignment, effectively combining preference supervision and scalar reward feedback.

Contribution

It extends divergence-based alignment methods to reinforcement learning with scalar rewards, proposing new algorithms that improve reward optimization and safety in language models.

Findings

01

$f$-GRPO outperforms GRPO on math-reasoning RLVR tasks.

02

$f$-HAL reduces reward hacking in safety alignment scenarios.

03

The proposed objectives estimate $f$-divergences between aligned and unaligned distributions.

Abstract

Recent work shows that preference alignment objectives can be interpreted as divergence estimators between aligned (preferred) & unaligned (less-preferred) distributions, yielding a principled recipe for designing alignment losses. However, this view has so far been limited to preference-based supervision. We extend it to general LLM alignment, including reinforcement learning with verifiable rewards (RLVR), where alignment feedback is given only as scalar rewards. We introduce $f$ -Group Relative Policy Optimization ( $f$ -GRPO), a class of on-policy RL objectives, and $f$ -Hybrid Alignment Loss ( $f$ -HAL), which combines on-policy reward optimization with off-policy preference supervision. We show that these objectives estimate $f$ -divergences between reward-aligned & reward-unaligned distributions induced by above- & below-average reward responses, and prove expected reward improvement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.