MiCoTA: Bridging the Learnability Gap with Intermediate CoT and Teacher Assistants

Dongyi Ding; Tiannan Wang; Chenghao Zhu; Meiling Tao; Yuchen Eleanor Jiang; Wangchunshu Zhou

arXiv:2507.01887·cs.CL·July 3, 2025

MiCoTA: Bridging the Learnability Gap with Intermediate CoT and Teacher Assistants

Dongyi Ding, Tiannan Wang, Chenghao Zhu, Meiling Tao, Yuchen Eleanor Jiang, Wangchunshu Zhou

PDF

Open Access 3 Reviews

TL;DR

MiCoTA introduces intermediate-sized models and reasoning sequences to enhance small language models' ability to learn long-chain reasoning, significantly improving their performance on complex reasoning benchmarks.

Contribution

The paper proposes MiCoTA, a novel distillation framework using intermediate models and reasoning sequences to bridge the learnability gap for small language models.

Findings

01

SLMs distilled with MiCoTA outperform baseline models on reasoning benchmarks.

02

MiCoTA produces data more aligned with small model distributions.

03

Significant improvements in reasoning scores on multiple benchmarks.

Abstract

Large language models (LLMs) excel at reasoning tasks requiring long thought sequences for planning, reflection, and refinement. However, their substantial model size and high computational demands are impractical for widespread deployment. Yet, small language models (SLMs) often struggle to learn long-form CoT reasoning due to their limited capacity, a phenomenon we refer to as the "SLMs Learnability Gap". To address this, we introduce \textbf{Mi}d-\textbf{Co}T \textbf{T}eacher \textbf{A}ssistant Distillation (MiCoTAl), a framework for improving long CoT distillation for SLMs. MiCoTA employs intermediate-sized models as teacher assistants and utilizes intermediate-length CoT sequences to bridge both the capacity and reasoning length gaps. Our experiments on downstream tasks demonstrate that although SLMs distilled from large teachers can perform poorly, by applying MiCoTA, they achieve…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

- Clear problem motivation with concrete experimental evidence - Well-structured paper with logical flow from problem identification to solution - Addresses a practical problem in deploying reasoning-capable SLMs

Weaknesses

- No rigorous theoretical explanation for why intermediate-length CoT should help beyond the intuitive capacity/length gap argument - Only evaluated on math reasoning tasks (AIME, AMC, Olympiad, MATH-500, GSM8K) - Missing analysis of what happens with different TA sizes

Reviewer 02Rating 4Confidence 3

Strengths

1. The framework design is reasonable and has a certain novelty: The paper addresses a practical issue in Large Language Model (LLM) knowledge distillation and proposes a clear solution with sufficient motivation. The concept of "learning capability gap" relatively accurately summarizes the problem, and the proposed "teacher-assistant-student" pipeline is a reasonable and logically consistent approach to bridge this gap. 2. The experimental results are convincing and show good performance: The m

Weaknesses

1. Domain Generalization：All evaluations are conducted on mathematical reasoning tasks. It remains unclear whether MiCOTA generalizes to other reasoning-heavy domains (e.g., code reasoning, legal text analysis, multi-hop QA). 2. Faithfulness and Error Propagation in Mid-CoT：This methodology relies on a Teacher Assistant that is not perfect, leading to the risk of error propagation that has not been fully addressed. If the intermediate CoT generated by the TA has flaws, omissions, or misunderstan

Reviewer 03Rating 4Confidence 3

Strengths

1. Clear problem definition. This paper clearly identifies the SLM learnability gap as a crucial issue in long-CoT distillation, framing it along both capacity and reasoning-length dimensions. 2. Intuitive idea. The “half-size, half-length” strategy is intuitive and well-grounded. Combining teacher-assistant distillation with model-merging (DARE + TIES) to generate mid-length CoTs is novel in the CoT context. 3. Thorough experimental validation. The experimental results are across multiple mode

Weaknesses

1. Lack of theoretical grounding. The key novelty is their combination for “half-size, half-length” CoTs. The paper does not provide a principled criterion for how much to shorten CoTs or why the proposed merge yields half length beyond an anecdotal trend and a qualitative claim about “approximately half” tokens. 2. Evaluation is narrow on math. All five core benchmarks are mathematical or math-heavy. This makes it unclear whether MICOTA generalizes to non-mathematical reasoning. The current ev

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)