Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning
Siyuan Gan, Jiaheng Liu, Boyan Wang, Tianpei Yang, Runqing Miao, Yuyao Zhang, Fanyu Meng, Junlan Feng, Linjian Meng, Jing Huo, Yang Gao

TL;DR
This paper introduces Thinking-Based Non-Thinking (TNT), a novel method that reduces computational costs and mitigates reward hacking in hybrid reasoning models by adaptively setting token limits without supervised fine-tuning.
Contribution
TNT is the first approach to address reward hacking in hybrid reasoning models without relying on supervised fine-tuning, using adaptive token limits based on reasoning complexity.
Findings
TNT reduces token usage by around 50%.
TNT significantly improves accuracy over baseline methods.
Reward hacking probability remains below 10% with TNT.
Abstract
Large reasoning models (LRMs) have attracted much attention due to their exceptional performance. However, their performance mainly stems from thinking, a long Chain of Thought (CoT), which significantly increase computational overhead. To address this overthinking problem, existing work focuses on using reinforcement learning (RL) to train hybrid reasoning models that automatically decide whether to engage in thinking or not based on the complexity of the query. Unfortunately, using RL will suffer the the reward hacking problem, e.g., the model engages in thinking but is judged as not doing so, resulting in incorrect rewards. To mitigate this problem, existing works either employ supervised fine-tuning (SFT), which incurs high computational costs, or enforce uniform token limits on non-thinking responses, which yields limited mitigation of the problem. In this paper, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Multimodal Machine Learning Applications · IoT and Edge/Fog Computing
