TL;DR
This paper introduces TCOD, a curriculum-based on-policy distillation method that stabilizes training and improves performance of multi-turn autonomous agents by controlling trajectory exposure.
Contribution
It proposes a simple curriculum approach to mitigate trajectory-level KL instability in multi-turn on-policy distillation, leading to better agent performance.
Findings
TCOD reduces KL divergence escalation during training.
TCOD improves multi-turn agent success rates by up to 18 points.
TCOD can outperform the teacher model and generalize to challenging tasks.
Abstract
On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn error compounding: as errors accumulate, the student is driven beyond the teacher's effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On-Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
