TL;DR
This paper introduces SERL, a framework for multi-turn agents that selectively uses environment feedback to improve reinforcement learning success rates in complex tasks.
Contribution
The paper presents SERL, a novel selective environment-reweighted learning method that effectively leverages various feedback sources for better multi-turn agent training.
Findings
SERL achieves 90.0% success on ALFWorld.
SERL outperforms strong RL and distillation baselines.
Grounded, action-relevant feedback improves learning.
Abstract
Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment-reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
