GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
Zhichao Wang

TL;DR
GIFT introduces a reward matching approach for on-policy RL of LLMs, replacing scalar coefficients with prompt-dependent ones, leading to faster convergence and better overfitting control.
Contribution
It proposes GIFT, a novel method combining group sampling, implicit reward, and advantage matching, with an endogenous, prompt-adaptive KL coefficient, improving over prior reward maximization techniques.
Findings
GIFT converges faster than GRPO, DAPO, and GSPO on various benchmarks.
GIFT overfits less on RLVR datasets like GSM8K, MATH, AIME.
GIFT achieves higher length-controlled win rates on RLHF evaluations.
Abstract
This paper investigates whether reward matching is a viable alternative to reward maximization methods for on-policy RL of LLMs. Group-relative Implicit Fine-Tuning (GIFT) is proposed, combining GRPO-style group sampling, DPO-style implicit reward, and UNA-style MSE between implicit and explicit advantages. By applying z-score standardization, the intractable partition function in the DPO implicit reward is canceled, and the KL coefficient is eliminated from the RLHF and RLVR objective. The population minimizers of are characterized in closed form: they coincide exactly with the GRPO/RLHF solution family , with a prompt-dependent, variance-determined KL coefficient . GIFT therefore solves the same parametric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
