Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections
Niklas Lauffer, Xiang Deng, Srivatsa Kundurthy, Brad Kenstler, Jeff Da

TL;DR
This paper introduces on-policy expert corrections (OECs), a novel data generation method inspired by DAgger, to improve multi-turn language model training by addressing covariate shift, showing significant performance gains in software engineering tasks.
Contribution
The paper proposes OECs, a new on-policy data collection technique for multi-turn LM training that mitigates covariate shift and enhances performance over traditional imitation learning.
Findings
OEC trajectories outperform traditional imitation learning by 14-13% in software engineering tasks.
Combining expert demonstrations with on-policy data improves multi-turn LM agent training.
Experiments validate the effectiveness of OECs in multi-turn language modeling scenarios.
Abstract
A popular paradigm for training LM agents relies on imitation learning, fine-tuning on expert trajectories. However, we show that the off-policy nature of imitation learning for multi-turn LM agents suffers from the fundamental limitation known as covariate shift: as the student policy's behavior diverges from the expert's, it encounters states not present in the training data, reducing the effectiveness of fine-tuning. Taking inspiration from the classic DAgger algorithm, we propose a novel data generation methodology for addressing covariate shift for multi-turn LLM training. We introduce on-policy expert corrections (OECs), partially on-policy data generated by starting rollouts with a student model and then switching to an expert model part way through the trajectory. We explore the effectiveness of our data generation technique in the domain of software engineering (SWE) tasks, a…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper identifies an important that hasn’t been addressed yet (covariate shift in behavior cloning for LLMs), and proposes a well-motivated and easy-to-understand method The experiment design is appropriate for the claims in the paper, and there’s evidence that (i) there is covariate shift between learner and expert’s policies trajectory distribution, and (ii) OECs improve performance for SWE task for 7B and 32B models derived from Qwen2.5-Coder-Instruct series. They include additional exper
The paper introduces OEC as a novel data-generation method, but this strategy of learning from trajectories where we use on-policy behavior till a certain timestep and then switching to expert policy for the rest of the episode has been studied before. For example, this family of strategies is the focus of “Learning to Search Better than Your Teacher” by Chang et. al. at ICML 2015. I think the paper should cite and discuss previous work on OEC in the related works section, and appropriately fram
- The paper empirically quantifies turn-wise divergence between student and expert, motivating partial on-policy corrections. - OEC is simple to implement (random switch, mask student tokens, verifier-gated acceptance) and yields consistent gains over BC and fully on-policy data under the stated setup. - The study isolates two actionable levers—on-policy masking and repetition filtering—and shows they materially affect results, especially at larger model scales. - Findings like “later switche
1. **Limited novelty relative to established interactive IL.** **Problem:** OEC’s core idea—on-policy rollouts with expert corrections—substantially overlaps with DAgger-style data aggregation and related learning-to-search methods; the paper’s unique elements are mainly engineering choices (random switch policy and masking) rather than a new principle. ICLR, novelty beyond well-known IL frameworks is expected; otherwise significance hinges on breadth and rigor of validation. **Action:**
* The idea of the approach is simple and easy to follow * Emperical results show improvements over prior works on open benchmark * The analysis on covariate shift is nicely done
* The author mentions the potential of using RL with verifier rewards, but does not compare with any baselines using RL. * Based on results, later switching improves the model more, but the method uses uniform sampling to determine the switch time. This seems to be a bit contradictory. * The authors mention the no-regret guarantee of DAgger is violated by using OECs, but do not provide any additional theoretical insights. * There are no explanations on the results provided in Table 3, so I am a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Software Engineering Research
