Stabilizing Long-term Multi-turn Reinforcement Learning with Gated Rewards
Zetian Sun, Dongfang Li, Zhuoen Chen, Yuhuai Qin, Baotian Hu

TL;DR
This paper introduces a new reward shaping method called Gated Reward Accumulation (G-RA) within a unified RL framework to improve long-term multi-turn reinforcement learning in software engineering tasks, addressing reward sparsity and misalignment issues.
Contribution
The paper proposes G-RA, a novel reward accumulation technique that enhances stability and performance in long-horizon RL, along with a comprehensive SWE-oriented RL framework supporting multi-turn interactions.
Findings
G-RA significantly increases task completion rates.
G-RA improves modification rates without policy degradation.
The framework effectively supports multi-turn reasoning in SWE tasks.
Abstract
Reward sparsity in long-horizon reinforcement learning (RL) tasks remains a significant challenge, while existing outcome-based reward shaping struggles to define meaningful immediate rewards without introducing bias or requiring explicit task decomposition. Alternatively, verification-based reward shaping uses stepwise critics, but misalignment between immediate rewards and long-term objectives can lead to reward hacking and suboptimal policies. In this work, we address this problem in the context of software engineering (SWE) tasks, where multi-turn reasoning and rule-based verification are critical. We introduce the SWE-oriented RL Framework, a unified system supporting multi-turn interaction, docker-based execution, and customizable reward functions. Additionally, we propose Gated Reward Accumulation (G-RA), a novel method that accumulates immediate rewards only when high-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
