Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models
Abha Jha, Akanksha Mahajan, Ashwath Vaithinathan Aravindan, Praveen Saravanan, Sai Sailaja Policharla, Sonal Chaturbhuj Gehlot

TL;DR
This paper explores a reinforcement learning approach that explicitly rewards models for abstaining from answering to reduce hallucinations, demonstrating its effectiveness across multiple benchmarks and model sizes.
Contribution
It introduces a verifiable reward framework for training LLMs to abstain, improving factual accuracy and reliability in open-ended and multiple-choice tasks.
Findings
Moderate abstention rewards reduce incorrect answers.
Larger models are more robust to abstention incentives.
Supervised abstention training helps mitigate exploration limitations.
Abstract
Large Language Models (LLMs) often produce hallucinated or unverifiable content, undermining their reliability in factual domains. This work investigates Reinforcement Learning with Verifiable Rewards (RLVR) as a training paradigm that explicitly rewards abstention ("I don't know") alongside correctness to promote intellectual humility. We fine-tune and evaluate Granite-3.3-2B-Instruct and Qwen-3-4B-Instruct on the MedMCQA and Hendrycks Math benchmarks using a ternary reward structure (, r_abs, 1) under varying abstention reward structures. We further study the effect of combining RLVR with supervised fine-tuning strategies that teach abstention prior to reinforcement learning. Our results show that moderate abstention rewards (r_abs to 0.3) consistently reduce incorrect responses without severe accuracy degradation on multiple-choice tasks, with larger models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
