Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model
Bowen Ding, Yuhan Chen, Futing Wang, Lingfeng Ming, Tao Lin

TL;DR
This paper investigates the overthinking dilemma in large reasoning models caused by thinking tokens, proposing a novel optimization algorithm to improve token efficiency and reasoning performance.
Contribution
It introduces DuP-PO, a new algorithm that reduces unnecessary thinking tokens and enhances reasoning efficiency in large models.
Findings
DuP-PO improves token efficiency on math reasoning benchmarks.
The method enhances reasoning performance while reducing overthinking.
Experimental results show significant gains over baseline models.
Abstract
Large Reasoning Models (LRMs) excel at solving complex problems but face an overthinking dilemma. When handling simple tasks, they often produce verbose responses overloaded with thinking tokens (e.g., wait, however). These tokens trigger unnecessary high-level reasoning behaviors like reflection and backtracking, reducing efficiency. In this work, our pilot study reveals that these thinking-token-induced behaviors are not essential for effective problem-solving and may even hinder correct reasoning within constrained token budgets. We identify this phenomenon as the thinking trap. To mitigate this issue, we propose Dual Policy Preference Optimization (DuP-PO), a novel algorithm featuring: (1) A rollout sampling strategy that guarantees balanced exposure to responses with and without thinking tokens; (2) A fine-grained advantage control technique to dynamically regulate the prediction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics · Intelligent Tutoring Systems and Adaptive Learning
MethodsBalanced Selection
