Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models

Tingchen Fu; Jiawei Gu; Yafu Li; Xiaoye Qu; Yu Cheng

arXiv:2505.14810·cs.CL·May 27, 2025

Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models

Tingchen Fu, Jiawei Gu, Yafu Li, Xiaoye Qu, Yu Cheng

PDF

Open Access 1 Repo 10 Models 1 Datasets 4 Reviews

TL;DR

This paper evaluates how large reasoning models balance mathematical reasoning capabilities with adherence to natural language instructions, revealing a trade-off between reasoning performance and controllability.

Contribution

Introduces MathIF, a benchmark for assessing instruction-following in mathematical reasoning, and analyzes the tension between reasoning ability and instruction adherence in LLMs.

Findings

01

Scaling reasoning models often reduces instruction compliance.

02

Training with long chains-of-thought or reinforcement learning degrades obedience.

03

Simple interventions can improve instruction-following at the expense of reasoning quality.

Abstract

Instruction-following is essential for aligning large language models (LLMs) with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical problems, their ability to adhere to natural language instructions remains underexplored. In this work, we introduce MathIF, a dedicated benchmark for evaluating instruction-following in mathematical reasoning tasks. Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability, as models that reason more effectively often struggle to comply with user directives. We find that models tuned on distilled long chains-of-thought or trained with reasoning-oriented reinforcement learning often degrade in instruction adherence, especially when generation length increases. Furthermore, we show that even simple interventions can partially recover…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 8Confidence 4

Strengths

- Well designed testbed, combining different inference task types with deterministic control features. - Comprehensive and systematic empirical analysis and interpretation. - Clear empirical signal across different base models and tasks. - Target properties of practical relevance.

Weaknesses

- Provides further systematic corroboratory evidence to a previously known phenomena, but does not bring light to the deeper mechanisms.

Reviewer 02Rating 2Confidence 5

Strengths

The main strength of this paper lies in its introduction of **MathIF**, a reasoning-oriented instruction-following dataset specifically designed for the mathematical domain. This benchmark extends existing instruction-following datasets by focusing on tasks that require complex reasoning chains rather than simple directive compliance. Moreover, it introduces meaningful evaluation criteria, such as hard and soft accuracy, that better capture how well models can balance reasoning performance with

Weaknesses

### **1. Unclear and inconsistent reasoning - base model pairing** The selection of base and reasoning-enhanced models is questionable. For example, DS-R1-distill-LLaMA is not a direct counterpart of Llama-3.3-70B-Instruct, as it is a distilled model derived from DeepSeek-R1. Other well-established base–reasoning model pairs such as Gemini–Gemini-1.5-Pro, Claude–Claude 3 Opus, or GPT-4o–GPT-4o-mini are ignored. This inconsistent model pairing undermines the validity of the comparisons and weaken

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper is easy to follow, and authors provide comprehensive experiments and analyses to support their claim. 2. MathIF is the first IF benchmark particularly targeted for reasoning models that isolate IF ability from domain mismatch. The setups are well-curated and sound. 3. Authors use 25 reasoning models, which provide sound experimental supports, and further analysis in Section 5 reconfirms the limitation in training methods for reasoning models.

Weaknesses

1. My biggest concern is the lack of novelty. The underperformance of instruction following of reasoning models is a well-known phenomenon in the community, and we could already observe this phenomenon using the existing benchmarks as authors provide at the first paragraph of Section 3. Leaving that aside, MathIF is an IF benchmark for math domain, where most of the settings follow the existing benchmarks. If authors want to truly isolate domain mismatch, they should conduct experiments on other

Reviewer 04Rating 2Confidence 5

Strengths

- novel evaluation and benchmark: MathIF is a nice systematic and domain-specific framework for measuring instruction adherence in mathematical reasoning tasks. - good evaluation: 23 LRMs, offering a robust empirical foundation across model sizes and architectures. - useful study: It identifies and quantifies the reasoning adherence trade-off, demonstrating that scaling reasoning capabilities often undermines instruction-following.

Weaknesses

- Domain limitation: MathIF is confined to the mathematical domain; results may not generalise to broader reasoning contexts, for instance, commonsense or multimodal reasoning. - limited training scope: The study primarily evaluates models trained via GRPO-based RL, limiting insights into other training paradigms. - mitigation strategies: The interventions (repeating constraints) offer partial and ad hoc solutions rather than principled training methods. - interpretability analysis missing: T

Code & Models

Repositories

tingchenfu/mathif
noneOfficial

Models

Datasets

haritzpuerto/math-if
dataset· 17 dl
17 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning