TL;DR
This paper introduces BEACON, a milestone-guided policy learning framework that improves training of long-horizon language agents by addressing credit misattribution and sample inefficiency, leading to significant performance gains.
Contribution
BEACON leverages task milestones for precise credit assignment, enhancing learning efficiency and success rates in long-horizon language agent tasks.
Findings
BEACON achieves 92.9% success on ALFWorld long-horizon tasks.
It nearly doubles the success rate compared to previous methods.
Sample utilization improves from 23.7% to 82.0% with BEACON.
Abstract
While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal failures, and sample inefficiency, where scarce successful trajectories result in near-total loss of learning signal. We introduce a milestone-guided policy learning framework, BEACON, that leverages the compositional structure of long-horizon tasks to ensure precise credit assignment. BEACON partitions trajectories at milestone boundaries, applies temporal reward shaping within segments to credit partial progress, and estimates advantages at dual scales to prevent distant failures from corrupting the evaluation of local actions. On ALFWorld, WebShop, and ScienceWorld, BEACON consistently outperforms GRPO and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
