Co-Evolution of Policy and Internal Reward for Language Agents

Xinyu Wang; Hanwei Wu; Jingwei Song; Shuyuan Zhang; Jiayi Zhang; Fanqi Kong; Tung Sum Thomas Kwok; Xiao-Wen Chang; Yuyu Luo; Chenglin Wu; Bang Liu

arXiv:2604.03098·cs.LG·April 6, 2026

Co-Evolution of Policy and Internal Reward for Language Agents

Xinyu Wang, Hanwei Wu, Jingwei Song, Shuyuan Zhang, Jiayi Zhang, Fanqi Kong, Tung Sum Thomas Kwok, Xiao-Wen Chang, Yuyu Luo, Chenglin Wu, Bang Liu

PDF

TL;DR

This paper introduces Self-Guide, an internal reward mechanism for language agents that enhances both inference guidance and policy training, leading to significant performance improvements.

Contribution

It proposes a co-evolving loop where internal reward and policy improve together, enabling better long-horizon learning for language agents.

Findings

01

Inference-time self-guidance improves performance.

02

Joint evolution of policy and internal reward yields 8% gains.

03

Internal reward learning enhances long-term policy optimization.

Abstract

Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.