Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

Wei-Lin Chen; Liqian Peng; Tian Tan; Chao Zhao; Blake JianHang Chen; Ziqian Lin; Alec Go; Yu Meng

arXiv:2602.13517·cs.CL·February 17, 2026

Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, Yu Meng

PDF

Open Access

TL;DR

This paper introduces a method to measure reasoning effort in large language models by identifying deep-thinking tokens, which correlates with accuracy and enables more efficient inference by early rejection of unpromising outputs.

Contribution

The authors propose the deep-thinking ratio as a new metric for reasoning effort and develop Think@n, a scaling strategy that improves efficiency and performance in LLM reasoning tasks.

Findings

01

Deep-thinking ratio correlates positively with accuracy across benchmarks.

02

Think@n reduces inference costs while maintaining or improving performance.

03

Deep-thinking tokens provide a more reliable measure of reasoning effort than length or confidence.

Abstract

Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Topic Modeling · Artificial Intelligence in Healthcare and Education