The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
Ruichen Zhang, Rana Muhammad Shahroz Khan, Zhen Tan, Dawei Li, Song Wang, Tianlong Chen

TL;DR
This paper introduces DC-CoT, a comprehensive benchmark for evaluating data-centric chain-of-thought distillation methods in large language models, focusing on data manipulation impacts on reasoning performance across various models and datasets.
Contribution
It presents the first systematic benchmark for data-centric CoT distillation, assessing data manipulation effects on model reasoning and generalization from multiple perspectives.
Findings
Data manipulation significantly affects model reasoning performance.
Best practices for data-centric CoT distillation improve model generalization.
Benchmark enables systematic evaluation of data techniques across models and datasets.
Abstract
Data-centric distillation, including data augmentation, selection, and mixing, offers a promising path to creating smaller, more efficient student Large Language Models (LLMs) that retain strong reasoning abilities. However, there still lacks a comprehensive benchmark to systematically assess the effect of each distillation approach. This paper introduces DC-CoT, the first data-centric benchmark that investigates data manipulation in chain-of-thought (CoT) distillation from method, model and data perspectives. Utilizing various teacher models (e.g., o4-mini, Gemini-Pro, Claude-3.5) and student architectures (e.g., 3B, 7B parameters), we rigorously evaluate the impact of these data manipulations on student model performance across multiple reasoning datasets, with a focus on in-distribution (IID) and out-of-distribution (OOD) generalization, and cross-domain transfer. Our findings aim to…
Peer Reviews
Decision·ICLR 2026 Poster
1. DC-CoT addresses a timely research gap by focusing on data-centric strategies in CoT distillation. 2. The evaluation structure is comprehensive, spanning different teacher models, various student model scales, and multiple evaluation settings (IID, OOD, Transfer). 3. The study directly confronts the challenge of making robust reasoning accessible to smaller models, which is paramount for reducing inference costs in real-world applications.
1. The exploration of specific data-centric strategies (augmentation, selection, mixing) appears superficial. The current implementation is limited to basic methods, failing to incorporate or evaluate more advanced, state-of-the-art (SOTA) data selection or augmentation techniques (e.g., based on model uncertainty or influence). 2. The study fails to clearly articulate the generalizability of the optimal data strategies. The results seem highly task-dependent, yet the authors do not provide a
1. In general, this contributes to the community by having such a controllable and comprehensive benchmark. I feel this work, in the long term, will support the data-centric distillation in reasoning. 2. Figures explaining different components of data-centric distillation are very clear. 3. The experiments are designed in a fine-grained and comprehensive manner.
1. The improvement of the teacher models is very fast, the observations found in this paper might not hold. 2. Following the previous, it would be great if the authors could consider expanding the diversity of teacher models. For example, clustering teacher models by architectural differences such as MoE/full activation, linear/full/sparse attention, etc. The observations might be more generalizable if we analyze from a perspective of architectural differences, instead of picking the strongest t
1. **High practical relevance**: Efficient reasoning via CoT distillation is a critical direction for deploying capable yet lightweight LLMs. The focus on *data-centric* methods—rather than architectural or training tricks—is timely and underexplored. 2. **Systematic design**: DC-CoT offers a unified framework to compare augmentation, selection, and mixing strategies across modalities, tasks, and model pairs—something missing in prior work. 3. **Empirical rigor**: The paper includes large-scale
1. **Significance of Research Question** The core problem—improving CoT distillation via data manipulation—is undeniably important for efficient reasoning. However, the paper does not clearly demarcate what constitutes a *data-centric* CoT distillation method versus general data synthesis or instruction tuning. For instance, “question augmentation” blurs the line with self-instruct-style data generation. This conceptual fuzziness slightly weakens the benchmark’s focus. 2. **Problematic constru
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
MethodsFocus
