HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents
Woongyeng Yeo, Yumin Choi, Taekyung Ki, Sung Ju Hwang

TL;DR
HINT-SD is a targeted self-distillation method that improves long-horizon agent training by selectively applying feedback to failure-relevant actions, enhancing efficiency and effectiveness.
Contribution
The paper introduces HINT-SD, a novel framework that uses full-trajectory hindsight to target specific actions for feedback, addressing inefficiencies in previous methods.
Findings
HINT-SD outperforms dense feedback baselines by up to 18.80%.
It achieves 2.26× lower training step time.
Targeted feedback selection is crucial for long-horizon training.
Abstract
Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
