Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation

Bing Wang; Shaotian Yan; Chen Shen; kaiyuan liu; Sinan Fan; Ximing Li; Rui Miao; Xiaosong Yuan; Zhanming Shen; Jieping Ye

arXiv:2605.19433·cs.CL·May 20, 2026

Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation

Bing Wang, Shaotian Yan, Chen Shen, kaiyuan liu, Sinan Fan, Ximing Li, Rui Miao, Xiaosong Yuan, Zhanming Shen, Jieping Ye

PDF

TL;DR

The paper introduces MOTAB, a novel distillation pipeline for LLM reasoning that dynamically backtracks to correct errors, effectively mitigating dual exposure biases and improving reasoning performance.

Contribution

It proposes a new method that monitors and backtracks student reasoning trajectories to address both off-policy and on-policy exposure biases in LLM distillation.

Findings

01

MOTAB achieves approximately 3% performance improvement on reasoning tasks.

02

The approach effectively alleviates dual exposure biases in LLM reasoning distillation.

03

Experimental results on LIMO-v2 and AceReason datasets validate the method's effectiveness.

Abstract

Large language models (LLMs) have achieved remarkable success in complex reasoning tasks via long chain-of-thought (CoT), yet their immense computational overhead hinders real-world deployment. LLM reasoning distillation addresses this by transferring reasoning capabilities from formidable teacher models to compact student models. However, existing distillation paradigms face a fundamental dilemma. Typical off-policy distillation strictly utilizes teacher-generated golden trajectories, suffering from an exposure bias due to the mismatch between training distributions and student-generated inference contexts, which leads to error cascades in long CoT reasoning. To address this, on-policy distillation allows students to explore their own trajectories, but we demonstrate that it inherently introduces a reciprocal reversed exposure bias: the teacher model also struggles to provide positive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.