Stabilizing Long-term Multi-turn Reinforcement Learning with Gated Rewards

Zetian Sun; Dongfang Li; Zhuoen Chen; Yuhuai Qin; Baotian Hu

arXiv:2508.10548·cs.LG·August 15, 2025

Stabilizing Long-term Multi-turn Reinforcement Learning with Gated Rewards

Zetian Sun, Dongfang Li, Zhuoen Chen, Yuhuai Qin, Baotian Hu

PDF

TL;DR

This paper introduces a new reward shaping method called Gated Reward Accumulation (G-RA) within a unified RL framework to improve long-term multi-turn reinforcement learning in software engineering tasks, addressing reward sparsity and misalignment issues.

Contribution

The paper proposes G-RA, a novel reward accumulation technique that enhances stability and performance in long-horizon RL, along with a comprehensive SWE-oriented RL framework supporting multi-turn interactions.

Findings

01

G-RA significantly increases task completion rates.

02

G-RA improves modification rates without policy degradation.

03

The framework effectively supports multi-turn reasoning in SWE tasks.

Abstract

Reward sparsity in long-horizon reinforcement learning (RL) tasks remains a significant challenge, while existing outcome-based reward shaping struggles to define meaningful immediate rewards without introducing bias or requiring explicit task decomposition. Alternatively, verification-based reward shaping uses stepwise critics, but misalignment between immediate rewards and long-term objectives can lead to reward hacking and suboptimal policies. In this work, we address this problem in the context of software engineering (SWE) tasks, where multi-turn reasoning and rule-based verification are critical. We introduce the SWE-oriented RL Framework, a unified system supporting multi-turn interaction, docker-based execution, and customizable reward functions. Additionally, we propose Gated Reward Accumulation (G-RA), a novel method that accumulates immediate rewards only when high-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.