The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation

Ruichen Zhang; Rana Muhammad Shahroz Khan; Zhen Tan; Dawei Li; Song Wang; Tianlong Chen

arXiv:2505.18759·cs.AI·February 9, 2026

The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation

Ruichen Zhang, Rana Muhammad Shahroz Khan, Zhen Tan, Dawei Li, Song Wang, Tianlong Chen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces DC-CoT, a comprehensive benchmark for evaluating data-centric chain-of-thought distillation methods in large language models, focusing on data manipulation impacts on reasoning performance across various models and datasets.

Contribution

It presents the first systematic benchmark for data-centric CoT distillation, assessing data manipulation effects on model reasoning and generalization from multiple perspectives.

Findings

01

Data manipulation significantly affects model reasoning performance.

02

Best practices for data-centric CoT distillation improve model generalization.

03

Benchmark enables systematic evaluation of data techniques across models and datasets.

Abstract

Data-centric distillation, including data augmentation, selection, and mixing, offers a promising path to creating smaller, more efficient student Large Language Models (LLMs) that retain strong reasoning abilities. However, there still lacks a comprehensive benchmark to systematically assess the effect of each distillation approach. This paper introduces DC-CoT, the first data-centric benchmark that investigates data manipulation in chain-of-thought (CoT) distillation from method, model and data perspectives. Utilizing various teacher models (e.g., o4-mini, Gemini-Pro, Claude-3.5) and student architectures (e.g., 3B, 7B parameters), we rigorously evaluate the impact of these data manipulations on student model performance across multiple reasoning datasets, with a focus on in-distribution (IID) and out-of-distribution (OOD) generalization, and cross-domain transfer. Our findings aim to…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. DC-CoT addresses a timely research gap by focusing on data-centric strategies in CoT distillation. 2. The evaluation structure is comprehensive, spanning different teacher models, various student model scales, and multiple evaluation settings (IID, OOD, Transfer). 3. The study directly confronts the challenge of making robust reasoning accessible to smaller models, which is paramount for reducing inference costs in real-world applications.

Weaknesses

1. The exploration of specific data-centric strategies (augmentation, selection, mixing) appears superficial. The current implementation is limited to basic methods, failing to incorporate or evaluate more advanced, state-of-the-art (SOTA) data selection or augmentation techniques (e.g., based on model uncertainty or influence). 2. The study fails to clearly articulate the generalizability of the optimal data strategies. The results seem highly task-dependent, yet the authors do not provide a

Reviewer 02Rating 8Confidence 4

Strengths

1. In general, this contributes to the community by having such a controllable and comprehensive benchmark. I feel this work, in the long term, will support the data-centric distillation in reasoning. 2. Figures explaining different components of data-centric distillation are very clear. 3. The experiments are designed in a fine-grained and comprehensive manner.

Weaknesses

1. The improvement of the teacher models is very fast, the observations found in this paper might not hold. 2. Following the previous, it would be great if the authors could consider expanding the diversity of teacher models. For example, clustering teacher models by architectural differences such as MoE/full activation, linear/full/sparse attention, etc. The observations might be more generalizable if we analyze from a perspective of architectural differences, instead of picking the strongest t

Reviewer 03Rating 6Confidence 4

Strengths

1. **High practical relevance**: Efficient reasoning via CoT distillation is a critical direction for deploying capable yet lightweight LLMs. The focus on *data-centric* methods—rather than architectural or training tricks—is timely and underexplored. 2. **Systematic design**: DC-CoT offers a unified framework to compare augmentation, selection, and mixing strategies across modalities, tasks, and model pairs—something missing in prior work. 3. **Empirical rigor**: The paper includes large-scale

Weaknesses

1. **Significance of Research Question** The core problem—improving CoT distillation via data manipulation—is undeniably important for efficient reasoning. However, the paper does not clearly demarcate what constitutes a *data-centric* CoT distillation method versus general data synthesis or instruction tuning. For instance, “question augmentation” blurs the line with self-instruct-style data generation. This conceptual fuzziness slightly weakens the benchmark’s focus. 2. **Problematic constru

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

MethodsFocus