The Challenge of Teaching Reasoning to LLMs Without RL or Distillation

Wei Du; Branislav Kisacanin; George Armstrong; Shubham Toshniwal; Ivan Moshkov; Alexan Ayrapetyan; Sadegh Mahdavi; Dan Zhao; Shizhe Diao; Dragan Masulovic; Marius Stanean; Advaith Avadhanam; Max Wang; Ashmit Dutta; Shitij Govil; Sri Yanamandara; Mihir Tandon; Sriram Ananthakrishnan; Vedant Rathi; David Zhang; Joonseok Kang; Leon Luo; Titu Andreescu; Boris Ginsburg; and Igor Gitman

arXiv:2507.09850·cs.AI·July 17, 2025

The Challenge of Teaching Reasoning to LLMs Without RL or Distillation

Wei Du, Branislav Kisacanin, George Armstrong, Shubham Toshniwal, Ivan Moshkov, Alexan Ayrapetyan, Sadegh Mahdavi, Dan Zhao, Shizhe Diao, Dragan Masulovic, Marius Stanean, Advaith Avadhanam, Max Wang, Ashmit Dutta, Shitij Govil, Sri Yanamandara, Mihir Tandon

PDF

Open Access 2 Datasets

TL;DR

This paper investigates inducing reasoning capabilities in base language models through minimal prompting and fine-tuning with high-quality Chain-of-Thought examples, showing promising results with small datasets.

Contribution

It demonstrates that a small number of carefully curated CoT examples can significantly enhance reasoning in base models without reinforcement learning or large-scale distillation.

Findings

01

Light fine-tuning with few high-quality CoT examples improves reasoning.

02

CoT data from non-reasoning models or humans is less effective than expert traces.

03

Curated reasoning data influences problem difficulty, diversity, and answer length.

Abstract

Reasoning-capable language models achieve state-of-the-art performance in diverse complex tasks by generating long, explicit Chain-of-Thought (CoT) traces. While recent works show that base models can acquire such reasoning traces via reinforcement learning or distillation from stronger models like DeepSeek-R1, previous works demonstrate that even short CoT prompting without fine-tuning is able to improve reasoning. We ask whether long CoT can be induced in a base model using only prompting or minimal tuning. Using just 20 long CoT examples from the reasoning model \texttt{QwQ-32B-Preview}, we lightly fine-tune the base model \texttt{Qwen2.5-32B}. The resulting model outperforms the much larger \texttt{Qwen2.5-Math-72B-Instruct}, showing that a handful of high-quality examples can unlock strong reasoning capabilities. We further explore using CoT data from non-reasoning models and human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Legal Education and Practice Innovations