Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks

Prateek Munjal; Clement Christophe; Ronnie Rajan; Praveenkumar Kanithi

arXiv:2601.13244·cs.LG·January 21, 2026

Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks

Prateek Munjal, Clement Christophe, Ronnie Rajan, Praveenkumar Kanithi

PDF

Open Access

TL;DR

This paper critically evaluates whether instruction tuning genuinely enhances reasoning in large language models, revealing that gains are inconsistent, heavily prompt-dependent, and brittle under domain shifts and perturbations.

Contribution

The study provides empirical evidence that instruction tuning's benefits are unstable and often superficial, especially under distribution shifts and varied evaluation settings.

Findings

01

Base models outperform instruction-tuned models in zero-shot reasoning tasks.

02

Instruction-tuned models rely heavily on specific prompts and perform poorly under distribution shifts.

03

Performance gains from instruction tuning are fragile and not guaranteed across different benchmarks.

Abstract

Instruction finetuning is standard practice for improving LLM performance, yet it remains unclear whether it enhances reasoning or merely induces surface-level pattern matching. We investigate this by evaluating base and instruction-tuned models on standard math benchmarks, structurally perturbed variants, and domain-shifted tasks. Our analysis highlights two key (often overlooked) limitations of instruction tuning. First, the performance advantage is unstable and depends heavily on evaluation settings. In zero-shot CoT settings on GSM8K, base models consistently outperform instruction-tuned variants, with drops as high as 32.67\% (Llama3-70B). Instruction-tuned models only match or exceed this performance when provided with few-shot exemplars, suggesting a reliance on specific prompting patterns rather than intrinsic reasoning. Second, tuning gains are brittle under distribution shift.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare