Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

Miguel Moura Ramos; Duarte M. Alves; Andr\'e F. T. Martins

arXiv:2605.12227·cs.CL·May 13, 2026

Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

Miguel Moura Ramos, Duarte M. Alves, Andr\'e F. T. Martins

PDF

1 Datasets

TL;DR

This paper introduces a new training method combining on-policy reinforcement learning and distillation to improve long-context reasoning in large language models, supported by a synthetic dataset.

Contribution

It proposes Distilled Group Relative Policy Optimization (dGRPO), integrating dense guidance from a teacher with on-policy optimization for better long-context performance.

Findings

01

dGRPO outperforms off-policy methods in long-context tasks.

02

The LongBlocks dataset enables effective evaluation of long-context reasoning.

03

Combining policy optimization with distillation improves stability and effectiveness.

Abstract

Adapting large language models (LLMs) to long-context tasks requires post-training methods that remain accurate and coherent over thousands of tokens. Existing approaches are limited in several ways: 1) off-policy methods such as supervised fine-tuning (SFT) and knowledge distillation (KD) suffer from exposure bias and limited recovery from model-generated errors over long horizons; 2) on-policy reinforcement learning methods such as Group Relative Policy Optimization (GRPO) better align training with model-generated states, but are unstable and sample-inefficient due to sparse rewards; 3) on-policy distillation (OPD) provides dense token-level guidance, but does not directly optimize arbitrary reward signals. In this paper, we propose Distilled Group Relative Policy Optimization (dGRPO), a method for long-context reasoning that augments GRPO with dense guidance from a stronger teacher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

utter-project/LongBlocks
dataset· 269 dl
269 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.