Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation
Bing Wang, Shaotian Yan, Chen Shen, kaiyuan liu, Sinan Fan, Ximing Li, Rui Miao, Xiaosong Yuan, Zhanming Shen, Jieping Ye

TL;DR
The paper introduces MOTAB, a novel distillation pipeline for LLM reasoning that dynamically backtracks to correct errors, effectively mitigating dual exposure biases and improving reasoning performance.
Contribution
It proposes a new method that monitors and backtracks student reasoning trajectories to address both off-policy and on-policy exposure biases in LLM distillation.
Findings
MOTAB achieves approximately 3% performance improvement on reasoning tasks.
The approach effectively alleviates dual exposure biases in LLM reasoning distillation.
Experimental results on LIMO-v2 and AceReason datasets validate the method's effectiveness.
Abstract
Large language models (LLMs) have achieved remarkable success in complex reasoning tasks via long chain-of-thought (CoT), yet their immense computational overhead hinders real-world deployment. LLM reasoning distillation addresses this by transferring reasoning capabilities from formidable teacher models to compact student models. However, existing distillation paradigms face a fundamental dilemma. Typical off-policy distillation strictly utilizes teacher-generated golden trajectories, suffering from an exposure bias due to the mismatch between training distributions and student-generated inference contexts, which leads to error cascades in long CoT reasoning. To address this, on-policy distillation allows students to explore their own trajectories, but we demonstrate that it inherently introduces a reciprocal reversed exposure bias: the teacher model also struggles to provide positive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
