Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward

Hao An; Yang Xu

arXiv:2510.24020·cs.CL·October 29, 2025

Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward

Hao An, Yang Xu

PDF

3 Reviews

TL;DR

This paper introduces a reinforcement learning approach that improves LLMs' ability to abstain from answering uncertain questions by using fine-grained semantic confidence, leading to more reliable responses.

Contribution

The paper proposes a novel semantic confidence reward framework that guides LLMs to abstain more accurately based on sample-specific semantic clustering.

Findings

01

Enhanced abstention accuracy in in-domain and out-of-distribution tests

02

Improved reliability of LLM responses with semantic confidence guidance

03

Introduction of a new metric for evaluating abstention reliability

Abstract

Mitigating hallucinations in Large Language Models (LLMs) is critical for their reliable deployment. Existing methods typically fine-tune LLMs to abstain from answering questions beyond their knowledge scope. However, these methods often rely on coarse-grained signals to guide LLMs to abstain, such as overall confidence or uncertainty scores on multiple sampled answers, which may result in an imprecise awareness of the model's own knowledge boundaries. To this end, we propose a novel reinforcement learning framework built on $\underline{Fi} ne-grained \underline{S} emantic \underline{Co} nfidence \underline{Re} ward (\Ours)$ , which guides LLMs to abstain via sample-specific confidence. Specifically, our method operates by sampling multiple candidate answers and conducting semantic clustering, then training the LLM to retain answers within high-confidence clusters and discard those…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

* The motivation is clear and the problem is significant. * The writing is clear and I can understand this work.

Weaknesses

* A key baseline is missing: TTRL [1], a reward design based on majority vote. Essentially, the method proposed in this paper, clustering answers semantically and determining confidence based on the number of answers in each cluster, is fundamentally an extension of majority voting. In other words, answers that appear more frequently across repeated samples are assigned higher confidence. * The exploration of reward combinations is insufficient. Some studies have shown that omitting the format r

Reviewer 02Rating 4Confidence 5

Strengths

1. The primary strength is moving from sentence- or response-level uncertainty to semantic chunk-level confidence. This allows the model to differentiate between known and unknown information within a single output, promoting more nuanced abstention 2. The method operationalizes the metacognitive ability (self-assessed confidence) into a quantifiable reward signal for RL, offering a direct path to behavioral modification.

Weaknesses

1. The approach relies on the model simultaneously generating the answer and its confidence. As evidenced by recent work on metacognitive decoupling (e.g., the Answer-Free Confidence Estimation (AFCE) framework[1]), eliciting the answer and confidence simultaneously can introduce a strong cognitive bias, leading to overconfidence. If the confidence signal itself is biased, the resulting RL reward (and thus the trained policy) will be flawed. 2. The "fine-grained" semantic confidence reward is f

Reviewer 03Rating 6Confidence 4

Strengths

The fine-grained reward moves beyond coarse, aggregated uncertainty metrics, further optimizing the model's abstention behavior. The F1_rel proposed by the authors reflects a pursuit of balance between helpfulness and truthfulness, and their discussion on the effectiveness of F1_rel compared to existing metrics also provides valuable insight. The experimental section is thorough and well-designed, comparing the method to strong baselines and verifying its effectiveness, especially its generaliza

Weaknesses

1. The effectiveness and necessity of the confidence reward are questionable. (a) Effectiveness: The confidence reward essentially guides the model to be more confident and consistent in its output. This might incorrectly encourage hallucinations. For instance, given a complex question the model cannot answer, it might obtain multiple different incorrect answers across samples. The confidence reward could end up rewarding the most frequent of these incorrect outputs, thereby causing the model to

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.