Revisiting DAgger in the Era of LLM-Agents
Changhao Li, Rushi Qiang, Jiawei Huang, Chenxiao Gao, Chao Zhang, Niao He, Bo Dai

TL;DR
This paper revisits the DAgger algorithm for training long-horizon language model agents, demonstrating it effectively mitigates covariate shift and improves performance on software engineering benchmarks.
Contribution
The paper adapts DAgger for multi-turn LM agents, combining on-policy interaction with supervised learning to enhance training effectiveness.
Findings
DAgger improves performance over baseline models on SWE-bench Verified.
4B-scale DAgger agent outperforms some larger models in software engineering tasks.
8B-scale DAgger agent surpasses existing SWE systems and approaches larger models.
Abstract
Long-horizon LM agents learn from multi-turn interaction, where a single early mistake can alter the subsequent state distribution and derail the whole trajectory. Existing recipes fall short in complementary ways: supervised fine-tuning provides dense teacher supervision but suffers from covariate shift because it is trained on off-policy teacher trajectories; while reinforcement learning with verifiable rewards avoids this off-policy mismatch by learning from on-policy rollouts but with only sparse outcome feedback. We address this dilemma by revisiting Dataset Aggregation (DAgger) for multi-turn LM agents: the algorithm collects trajectories through a turn-level interpolation of student and teacher policies, and the student is then trained on these trajectories using supervised labels provided by the teacher. By directly interacting with environments, we expose the model to realistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
