What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

Xiaozhe Li; Tianyi Lyu; Yang Li; Yichuan Ma; Peiji Li; Linyang Li; Qipeng Guo; Dahua Lin; Kai Chen

arXiv:2605.19447·cs.AI·May 20, 2026

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

Xiaozhe Li, Tianyi Lyu, Yang Li, Yichuan Ma, Peiji Li, Linyang Li, Qipeng Guo, Dahua Lin, Kai Chen

PDF

1 Repo

TL;DR

This paper introduces SERL, a framework for multi-turn agents that selectively uses environment feedback to improve reinforcement learning success rates in complex tasks.

Contribution

The paper presents SERL, a novel selective environment-reweighted learning method that effectively leverages various feedback sources for better multi-turn agent training.

Findings

01

SERL achieves 90.0% success on ALFWorld.

02

SERL outperforms strong RL and distillation baselines.

03

Grounded, action-relevant feedback improves learning.

Abstract

Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment-reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

oliverleexz/SERL
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.