Honesty over Accuracy: Trustworthy Language Models through Reinforced Hesitation
Mohamad Amin Mohamadi, Tianhao Wang, Zhiyuan Li

TL;DR
This paper introduces Reinforced Hesitation, a training method for language models that encourages them to abstain when uncertain, improving trustworthiness by reducing hallucinations and enabling better risk management.
Contribution
It proposes Reinforced Hesitation with ternary rewards for training models to abstain, and introduces inference strategies that leverage abstention for safer, more trustworthy responses.
Findings
Models trained with RH can effectively balance accuracy and abstention.
Abstention strategies outperform majority voting in reducing errors.
Reinforced Hesitation creates models that are more honest about their limitations.
Abstract
Modern language models fail a fundamental requirement of trustworthy intelligence: knowing when not to answer. Despite achieving impressive accuracy on benchmarks, these models produce confident hallucinations, even when wrong answers carry catastrophic consequences. Our evaluations on GSM8K, MedQA and GPQA show frontier models almost never abstain despite explicit warnings of severe penalties, suggesting that prompts cannot override training that rewards any answer over no answer. As a remedy, we propose Reinforced Hesitation (RH): a modification to Reinforcement Learning from Verifiable Rewards (RLVR) to use ternary rewards (+1 correct, 0 abstention, - error) instead of binary. Controlled experiments on logic puzzles reveal that varying produces distinct models along a Pareto frontier, where each training penalty yields the optimal model for its corresponding risk…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Topic Modeling
