Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs

Xingwei Gan; Ying Zhu

arXiv:2605.20555·cs.LG·May 21, 2026

Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs

Xingwei Gan, Ying Zhu

PDF

TL;DR

This paper presents a new method that combines a frozen reference policy with a trainable policy via logit averaging, improving performance in language model training without using KL regularization.

Contribution

The authors introduce a logit averaging technique integrated into GRPO that enhances policy training by leveraging both reference and trainable policies without KL regularization.

Findings

01

Achieves higher or comparable accuracy on MATH, cn-k12, and MMLU datasets.

02

Eliminates the need for KL regularization or critic in policy optimization.

03

Maintains the formatting advantage of supervised fine-tuning while leveraging reasoning capabilities.

Abstract

We introduce a novel method that averages the logits of a frozen reference policy (e.g., SFT) and a trainable policy, and incorporate the method into Group Relative Policy Optimization (GRPO). In contrast to Reinforcement Learning with Verifiable Rewards (RLVR) methods, our proposal does not involve a Kullback Leibler (KL) regularization or critic; the trainable policy and the reference anchor are coupled through the logit averaging structure to leverage the reasoning expertise of the trainable policy while maintaining the formatting advantage of SFT. Our method is evaluated on MATH, cn-k12, and MMLU, and the results show a higher accuracy or at least comparable accuracy relative to the canonical KL-regularized GRPO.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.