Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs
Xingwei Gan, Ying Zhu

TL;DR
This paper presents a new method that combines a frozen reference policy with a trainable policy via logit averaging, improving performance in language model training without using KL regularization.
Contribution
The authors introduce a logit averaging technique integrated into GRPO that enhances policy training by leveraging both reference and trainable policies without KL regularization.
Findings
Achieves higher or comparable accuracy on MATH, cn-k12, and MMLU datasets.
Eliminates the need for KL regularization or critic in policy optimization.
Maintains the formatting advantage of supervised fine-tuning while leveraging reasoning capabilities.
Abstract
We introduce a novel method that averages the logits of a frozen reference policy (e.g., SFT) and a trainable policy, and incorporate the method into Group Relative Policy Optimization (GRPO). In contrast to Reinforcement Learning with Verifiable Rewards (RLVR) methods, our proposal does not involve a Kullback Leibler (KL) regularization or critic; the trainable policy and the reference anchor are coupled through the logit averaging structure to leverage the reasoning expertise of the trainable policy while maintaining the formatting advantage of SFT. Our method is evaluated on MATH, cn-k12, and MMLU, and the results show a higher accuracy or at least comparable accuracy relative to the canonical KL-regularized GRPO.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
