TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

Jiaqi Wang; Wenhao Zhang; Weijie Shi; Yaliang Li; James Cheng

arXiv:2604.24005·cs.LG·April 30, 2026

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

Jiaqi Wang, Wenhao Zhang, Weijie Shi, Yaliang Li, James Cheng

PDF

1 Repo

TL;DR

This paper introduces TCOD, a curriculum-based on-policy distillation method that stabilizes training and improves performance of multi-turn autonomous agents by controlling trajectory exposure.

Contribution

It proposes a simple curriculum approach to mitigate trajectory-level KL instability in multi-turn on-policy distillation, leading to better agent performance.

Findings

01

TCOD reduces KL divergence escalation during training.

02

TCOD improves multi-turn agent success rates by up to 18 points.

03

TCOD can outperform the teacher model and generalize to challenging tasks.

Abstract

On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn error compounding: as errors accumulate, the student is driven beyond the teacher's effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On-Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kokolerk/TCOD
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.