ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding

Sining Zhoubian; Dan Zhang; Jie Tang

arXiv:2508.19576·cs.AI·September 9, 2025

ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding

Sining Zhoubian, Dan Zhang, Jie Tang

PDF

Open Access 2 Models

TL;DR

ReST-RL introduces a novel reinforcement learning framework with optimized self-training and decoding techniques to significantly enhance the code reasoning accuracy of large language models, outperforming existing methods on multiple coding benchmarks.

Contribution

The paper presents ReST-RL, a unified RL paradigm combining an improved GRPO algorithm with a value model-assisted decoding method, advancing LLM code reasoning capabilities.

Findings

01

ReST-RL outperforms baseline methods on coding benchmarks.

02

Optimized data filtering improves training efficiency.

03

Decoding with VM-MCTS enhances reasoning accuracy.

Abstract

With respect to improving the reasoning accuracy of LLMs, the representative reinforcement learning (RL) method GRPO faces failure due to insignificant reward variance, while verification methods based on process reward models (PRMs) suffer from difficulties with training data acquisition and verification effectiveness. To tackle these problems, this paper introduces ReST-RL, a unified LLM RL paradigm that significantly improves LLM's code reasoning ability by combining an improved GRPO algorithm with a meticulously designed test time decoding method assisted by a value model (VM). As the first stage of policy reinforcement, ReST-GRPO adopts an optimized ReST algorithm to filter and assemble high-value training data, increasing the reward variance of GRPO sampling, thus improving the effectiveness and efficiency of training. After the basic reasoning ability of LLM policy has been…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques