ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models

Qianyu He; Siyu Yuan; Xuefeng Li; Mingxuan Wang; Jiangjie Chen

arXiv:2508.18773·cs.CL·August 27, 2025

ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models

Qianyu He, Siyu Yuan, Xuefeng Li, Mingxuan Wang, Jiangjie Chen

PDF

3 Reviews

TL;DR

ThinkDial is an open-source framework that enables controllable reasoning in large language models by switching between different operational modes, balancing computational effort and performance effectively.

Contribution

This paper introduces the first open-source end-to-end system implementing gpt-oss-style reasoning control with discrete modes using novel training paradigms.

Findings

01

Achieves 50-75% token reduction with minimal performance loss

02

Enables seamless switching between reasoning modes

03

Demonstrates strong generalization on out-of-distribution tasks

Abstract

Large language models (LLMs) with chain-of-thought reasoning have demonstrated remarkable problem-solving capabilities, but controlling their computational effort remains a significant challenge for practical deployment. Recent proprietary systems like OpenAI's gpt-oss series have introduced discrete operational modes for intuitive reasoning control, but the open-source community has largely failed to achieve such capabilities. In this paper, we introduce ThinkDial, the first open-recipe end-to-end framework that successfully implements gpt-oss-style controllable reasoning through discrete operational modes. Our system enables seamless switching between three distinct reasoning regimes: High mode (full reasoning capability), Medium mode (50 percent token reduction with <10 percent performance degradation), and Low mode (75 percent token reduction with <15 percent performance…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

- **Timely and well-targeted.** The work reproduces a widely requested user affordance -- intuitive reasoning-effort control -- in an open stack. Figure 1 clearly illustrates this behavior. - **Strong observations and ablations.** The paper identifies a realistic failure mode -- reasoning leaking into the answer and quantifies it, and proposes a Leak Penalty that cuts total tokens while maintaining accuracy. Each component’s ablation shows measurable degradation, underscoring the necessity of t

Weaknesses

**[Major] Missing baselines.** The paper omits head-to-head comparisons with open controllability methods such as Shorter RL (e.g., L1, ThinkLess, CoT-Valve, TokenSkip, O1-Pruner, LightThinker) and binary gating approaches (AdaCoT, AdaptThink). Many are cited but not evaluated. It remains unclear whether these simpler methods can provide far better accuracy-vs-length curves than thinkdail and to understand whether the three-mode design incurs unnecessary trade-offs e.g. if LightThinker provides

Reviewer 02Rating 4Confidence 3

Strengths

* The work introduces a new framework for GPT-oss-style discrete conditioning in the text space. This is the first open reproduction of this idea. * The end-to-end training pipeline, based on SFT and RL, is compatible with common training pipelines for reasoning models, and can be easily integrated with existing recipes. * Despite being trained mainly on math reasoning data, it generalizes well to out-of-distribution tasks.

Weaknesses

* The SFT data is generated by truncating at $r_\text{med}$ and $r_\text{low}$ for medium and low conditioning regimes, respectively. Won't that lead to hallucination if the truncation happens when important steps have not yet finished? * DAPO, the framework uses by ThinkDial, normalizes the rewards with an std term. This is kept in L194 in the ThinkDial paper. However, this normalization will cause the length penalty to be amplified if the answer are all correct or incorrect in the sampling gro

Reviewer 03Rating 6Confidence 3

Strengths

- This paper provides a guideline on how to reproduce models like gpt-oss, enabling controllability of token length and allowing users to choose between latency and correctness. - The idea is simple and works well on multiple benchmarks. - The paper is easy to follow and read, and the evaluation is reliable as results are averaged across multiple runs.

Weaknesses

- The paper’s novelty mainly lies in providing comparability beyond the binary mode, but I’m not sure whether adding just one more option (a medium-length response) is truly meaningful. - The ACT metric definition is heuristic, and I also wonder how robust the results are to different α values. - There are some missing details on SFT data: how is truncation performed for medium and lower modes? What does the data look like? There are some examples in the Appendix, but it’s hard to understand whe

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.