GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA

Zhichao Wang

arXiv:2510.23868·cs.LG·May 15, 2026

GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA

Zhichao Wang

PDF

TL;DR

GIFT introduces a reward matching approach for on-policy RL of LLMs, replacing scalar coefficients with prompt-dependent ones, leading to faster convergence and better overfitting control.

Contribution

It proposes GIFT, a novel method combining group sampling, implicit reward, and advantage matching, with an endogenous, prompt-adaptive KL coefficient, improving over prior reward maximization techniques.

Findings

01

GIFT converges faster than GRPO, DAPO, and GSPO on various benchmarks.

02

GIFT overfits less on RLVR datasets like GSM8K, MATH, AIME.

03

GIFT achieves higher length-controlled win rates on RLHF evaluations.

Abstract

This paper investigates whether reward matching is a viable alternative to reward maximization methods for on-policy RL of LLMs. Group-relative Implicit Fine-Tuning (GIFT) is proposed, combining GRPO-style group sampling, DPO-style implicit reward, and UNA-style MSE between implicit and explicit advantages. By applying z-score standardization, the intractable partition function $Z (x)$ in the DPO implicit reward is canceled, and the KL coefficient $β$ is eliminated from the RLHF and RLVR objective. The population minimizers of $L_{GIFT}$ are characterized in closed form: they coincide exactly with the GRPO/RLHF solution family $π_{β}^{*} (y ∣ x) \propto π_{ref} (y ∣ x) e^{\frac{1}{β} r_{ϕ} (x, y)}$ , with a prompt-dependent, variance-determined KL coefficient $β (x) = \frac{σ _{ϕ} ( x )}{σ ^ _{θ} ( x )}$ . GIFT therefore solves the same parametric…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.