TL;DR
This paper introduces a reinforcement learning approach that improves LLMs' ability to abstain from answering uncertain questions by using fine-grained semantic confidence, leading to more reliable responses.
Contribution
The paper proposes a novel semantic confidence reward framework that guides LLMs to abstain more accurately based on sample-specific semantic clustering.
Findings
Enhanced abstention accuracy in in-domain and out-of-distribution tests
Improved reliability of LLM responses with semantic confidence guidance
Introduction of a new metric for evaluating abstention reliability
Abstract
Mitigating hallucinations in Large Language Models (LLMs) is critical for their reliable deployment. Existing methods typically fine-tune LLMs to abstain from answering questions beyond their knowledge scope. However, these methods often rely on coarse-grained signals to guide LLMs to abstain, such as overall confidence or uncertainty scores on multiple sampled answers, which may result in an imprecise awareness of the model's own knowledge boundaries. To this end, we propose a novel reinforcement learning framework built on , which guides LLMs to abstain via sample-specific confidence. Specifically, our method operates by sampling multiple candidate answers and conducting semantic clustering, then training the LLM to retain answers within high-confidence clusters and discard those…
Peer Reviews
Decision·Submitted to ICLR 2026
* The motivation is clear and the problem is significant. * The writing is clear and I can understand this work.
* A key baseline is missing: TTRL [1], a reward design based on majority vote. Essentially, the method proposed in this paper, clustering answers semantically and determining confidence based on the number of answers in each cluster, is fundamentally an extension of majority voting. In other words, answers that appear more frequently across repeated samples are assigned higher confidence. * The exploration of reward combinations is insufficient. Some studies have shown that omitting the format r
1. The primary strength is moving from sentence- or response-level uncertainty to semantic chunk-level confidence. This allows the model to differentiate between known and unknown information within a single output, promoting more nuanced abstention 2. The method operationalizes the metacognitive ability (self-assessed confidence) into a quantifiable reward signal for RL, offering a direct path to behavioral modification.
1. The approach relies on the model simultaneously generating the answer and its confidence. As evidenced by recent work on metacognitive decoupling (e.g., the Answer-Free Confidence Estimation (AFCE) framework[1]), eliciting the answer and confidence simultaneously can introduce a strong cognitive bias, leading to overconfidence. If the confidence signal itself is biased, the resulting RL reward (and thus the trained policy) will be flawed. 2. The "fine-grained" semantic confidence reward is f
The fine-grained reward moves beyond coarse, aggregated uncertainty metrics, further optimizing the model's abstention behavior. The F1_rel proposed by the authors reflects a pursuit of balance between helpfulness and truthfulness, and their discussion on the effectiveness of F1_rel compared to existing metrics also provides valuable insight. The experimental section is thorough and well-designed, comparing the method to strong baselines and verifying its effectiveness, especially its generaliza
1. The effectiveness and necessity of the confidence reward are questionable. (a) Effectiveness: The confidence reward essentially guides the model to be more confident and consistent in its output. This might incorrectly encourage hallucinations. For instance, given a complex question the model cannot answer, it might obtain multiple different incorrect answers across samples. The confidence reward could end up rewarding the most frequent of these incorrect outputs, thereby causing the model to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
