ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding
Sining Zhoubian, Dan Zhang, Jie Tang

TL;DR
ReST-RL introduces a novel reinforcement learning framework with optimized self-training and decoding techniques to significantly enhance the code reasoning accuracy of large language models, outperforming existing methods on multiple coding benchmarks.
Contribution
The paper presents ReST-RL, a unified RL paradigm combining an improved GRPO algorithm with a value model-assisted decoding method, advancing LLM code reasoning capabilities.
Findings
ReST-RL outperforms baseline methods on coding benchmarks.
Optimized data filtering improves training efficiency.
Decoding with VM-MCTS enhances reasoning accuracy.
Abstract
With respect to improving the reasoning accuracy of LLMs, the representative reinforcement learning (RL) method GRPO faces failure due to insignificant reward variance, while verification methods based on process reward models (PRMs) suffer from difficulties with training data acquisition and verification effectiveness. To tackle these problems, this paper introduces ReST-RL, a unified LLM RL paradigm that significantly improves LLM's code reasoning ability by combining an improved GRPO algorithm with a meticulously designed test time decoding method assisted by a value model (VM). As the first stage of policy reinforcement, ReST-GRPO adopts an optimized ReST algorithm to filter and assemble high-value training data, increasing the reward variance of GRPO sampling, thus improving the effectiveness and efficiency of training. After the basic reasoning ability of LLM policy has been…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
