Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning

Siyuan Gan; Jiaheng Liu; Boyan Wang; Tianpei Yang; Runqing Miao; Yuyao Zhang; Fanyu Meng; Junlan Feng; Linjian Meng; Jing Huo; Yang Gao

arXiv:2601.04805·cs.AI·January 9, 2026

Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning

Siyuan Gan, Jiaheng Liu, Boyan Wang, Tianpei Yang, Runqing Miao, Yuyao Zhang, Fanyu Meng, Junlan Feng, Linjian Meng, Jing Huo, Yang Gao

PDF

Open Access

TL;DR

This paper introduces Thinking-Based Non-Thinking (TNT), a novel method that reduces computational costs and mitigates reward hacking in hybrid reasoning models by adaptively setting token limits without supervised fine-tuning.

Contribution

TNT is the first approach to address reward hacking in hybrid reasoning models without relying on supervised fine-tuning, using adaptive token limits based on reasoning complexity.

Findings

01

TNT reduces token usage by around 50%.

02

TNT significantly improves accuracy over baseline methods.

03

Reward hacking probability remains below 10% with TNT.

Abstract

Large reasoning models (LRMs) have attracted much attention due to their exceptional performance. However, their performance mainly stems from thinking, a long Chain of Thought (CoT), which significantly increase computational overhead. To address this overthinking problem, existing work focuses on using reinforcement learning (RL) to train hybrid reasoning models that automatically decide whether to engage in thinking or not based on the complexity of the query. Unfortunately, using RL will suffer the the reward hacking problem, e.g., the model engages in thinking but is judged as not doing so, resulting in incorrect rewards. To mitigate this problem, existing works either employ supervised fine-tuning (SFT), which incurs high computational costs, or enforce uniform token limits on non-thinking responses, which yields limited mitigation of the problem. In this paper, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Multimodal Machine Learning Applications · IoT and Edge/Fog Computing