Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts

Maxime Heuillet; Yufei Cui; Boxing Chen; Audrey Durand; Prasanna Parthasarathi

arXiv:2508.10123·cs.LG·November 25, 2025

Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts

Maxime Heuillet, Yufei Cui, Boxing Chen, Audrey Durand, Prasanna Parthasarathi

PDF

TL;DR

Nested-ReFT introduces an off-policy reinforcement learning framework for large language model fine-tuning that reduces computational costs while maintaining high performance on reasoning tasks.

Contribution

The paper proposes Nested-ReFT, a novel off-policy reinforcement learning method with dynamic layer skipping to improve efficiency in LLM fine-tuning.

Findings

01

Significant reduction in training inference cost.

02

Maintains performance comparable to standard ReFT.

03

Improved tokens/sec across multiple benchmarks.

Abstract

Advanced reasoning in LLMs on challenging domains like mathematical reasoning can be tackled using verifiable rewards based reinforced fine-tuning (ReFT). In standard ReFT frameworks, a behavior model generates multiple completions with answers per problem, for the answer to be then scored by a reward function. While such RL post-training methods demonstrate significant performance improvements across challenging reasoning domains, the computational cost of generating completions during training with multiple inference steps makes the training cost non-trivial. To address this, we draw inspiration from off-policy RL, and speculative decoding to introduce a novel ReFT framework, dubbed Nested-ReFT, where a subset of layers of the target model acts as the behavior model to generate off-policy completions during training. The behavior model configured with dynamic layer skipping per batch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.