The Hallucination Tax of Reinforcement Finetuning
Linxin Song, Taiwei Shi, Jieyu Zhao

TL;DR
Reinforcement finetuning improves reasoning but causes models to hallucinate more on unanswerable questions, which can be mitigated by incorporating a small amount of synthetic unanswerable data.
Contribution
This work identifies the hallucination tax as a side effect of RFT and proposes a simple data augmentation method to restore refusal behavior in LLMs.
Findings
RFT reduces refusal rates by over 80%, increasing hallucinations.
Adding 10% SUM data restores refusal behavior with minimal accuracy loss.
Improved uncertainty reasoning enhances out-of-domain and factual question answering.
Abstract
Reinforcement finetuning (RFT) has become a standard approach for enhancing the reasoning capabilities of large language models (LLMs). However, its impact on model trustworthiness remains underexplored. In this work, we identify and systematically study a critical side effect of RFT, which we term the hallucination tax: a degradation in refusal behavior causing models to produce hallucinated answers to unanswerable questions confidently. To investigate this, we introduce SUM (Synthetic Unanswerable Math), a high-quality dataset of unanswerable math problems designed to probe models' ability to recognize an unanswerable question by reasoning from the insufficient or ambiguous information. Our results show that standard RFT training could reduce model refusal rates by more than 80%, which significantly increases model's tendency to hallucinate. We further demonstrate that incorporating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsFatigue and fracture mechanics · Advanced machining processes and optimization · Infrastructure Maintenance and Monitoring
