Thinking, Faithful and Stable: Mitigating Hallucinations in LLMs
Chelsea Zou, Yiheng Yao, Basant Khalil

TL;DR
This paper introduces a reinforcement learning framework that uses confidence and entropy signals to detect and reduce hallucinations in LLMs, enhancing reasoning stability and faithfulness.
Contribution
It presents a novel self-correcting RL approach that leverages fine-grained uncertainty signals to improve LLM reasoning and reduce hallucinations.
Findings
Improves final answer accuracy in LLMs.
Enhances reasoning calibration and faithfulness.
Validates individual contribution of uncertainty signals.
Abstract
This project develops a self correcting framework for large language models (LLMs) that detects and mitigates hallucinations during multi-step reasoning. Rather than relying solely on final answer correctness, our approach leverages fine grained uncertainty signals: 1) self-assessed confidence alignment, and 2) token-level entropy spikes to detect unreliable and unfaithful reasoning in real time. We design a composite reward function that penalizes unjustified high confidence and entropy spikes, while encouraging stable and accurate reasoning trajectories. These signals guide a reinforcement learning (RL) policy that makes the model more introspective and shapes the model's generation behavior through confidence-aware reward feedback, improving not just outcome correctness but the coherence and faithfulness of their intermediate reasoning steps. Experiments show that our method improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
