TL;DR
AgentHER leverages hindsight experience replay to convert failed LLM agent trajectories into valuable training data, significantly improving success rates and sample efficiency across multiple benchmarks.
Contribution
This work adapts HER to natural-language trajectories, introducing a four-stage pipeline that relabels failures for enhanced training of LLM agents.
Findings
AgentHER improves success rates by 7.6-11.4% over success-only SFT.
Achieves 2x sample efficiency on WebArena and ToolBench.
Reduces label noise from 5.9% to 2.9% with robustness mechanisms.
Abstract
LLM-agent training pipelines routinely discard failed trajectories even though GPT-4o achieves only 14-20% on WebArena and below 55% pass@1 on ToolBench; even specialised systems at 50-65% leave the majority of trajectories unused. We introduce AgentHER, which recovers this lost signal by adapting Hindsight Experience Replay (HER) to natural-language agent trajectories: a trajectory that fails goal A is often a correct demonstration for an achievable alternative goal B. AgentHER realises this through a four-stage pipeline (failure classification, outcome extraction, LLM-guided relabeling with confidence gating, and data packaging) that converts discarded failures into SFT, DPO, and ShareGPT training data. On WebArena and ToolBench under a strict task-disjoint held-out protocol, AgentHER improves over success-only SFT by +7.6-11.4% across four model families (GPT-4o, Qwen2.5-72B/7B,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
