Fact-Checking with Large Language Models via Probabilistic Certainty and Consistency
Haoran Wang, Maryam Khalid, Qiong Wu, Jian Gao, Cheng Cao

TL;DR
This paper introduces PCC, a framework that improves fact-checking in large language models by estimating their confidence through probabilistic certainty and consistency, enabling adaptive retrieval and enhancing accuracy and efficiency.
Contribution
The paper presents PCC, a novel confidence estimation method that guides adaptive fact-checking in LLMs, outperforming existing baselines and generalizing across models.
Findings
PCC achieves superior uncertainty quantification compared to verbalized confidence.
PCC outperforms strong LLM-based fact-checking baselines across benchmarks.
PCC generalizes well across various large language models.
Abstract
Large language models (LLMs) are increasingly used in applications requiring factual accuracy, yet their outputs often contain hallucinated responses. While fact-checking can mitigate these errors, existing methods typically retrieve external evidence indiscriminately, overlooking the model's internal knowledge and potentially introducing irrelevant noise. Moreover, current systems lack targeted mechanisms to resolve specific uncertainties in the model's reasoning. Inspired by how humans fact-check, we argue that LLMs should adaptively decide whether to rely on internal knowledge or initiate retrieval based on their confidence in a given claim. We introduce Probabilistic Certainty and Consistency (PCC), a framework that estimates factual confidence by jointly modeling an LLM's probabilistic certainty and reasoning consistency. These confidence signals enable an adaptive verification…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
concrete approach with direct and obvious technical applications w.r.t. today's customer facing industrial models. The applied methods seem to be designed to be robust to a model's output variation, (use of NLI, and mapping of multiple tokens to TRUE/FALSE outputs). Reproducibility efforts are made, with prompts being provided, and good method descriptions. Extensive and relevant model and dataset use. Comparison against multiple baselines.
the specific problem driven pipeline makes the authors choose strong assumptions: > Lexical uncertainty v.s. Semantic uncertainty is not directly adressed. As discussed by Kuhn et al 2023, the model could output multiple different (inconsistent) tokens that would nonetheless have the same meaning. Is there a reason why your 2 metrics would not fall prey to this? > In this specific context of optimising RAG retrieval, I notice multiple runs are required, with extra sampling steps. It would be r
1. The notion of Probabilistic Certainty and Consistency (PCC) framework is conceptually and methodologically sound, and empirically shown to be superior to single-signal methods. 2. The adaptive verification pipeline, which dynamically routes claims to different strategies based on the two-dimensional PCC confidence profile, is logical for optimizing efficiency and resources.
**Novelty**: The key components of PCC are log-probability based certainty and NLI-based consistency, both of which are well-established uncertainty estimation methods. The novelty is mainly in how they are integrated and adapted; no fundamentally new theoretical concept or learning framework has been proposed. Therefore, the overall novelty appears to be quite light. **Fixed threshold for routing decision**: Although the authors articulated this issue as a limitation in the Appendix, in my op
- The motivation is clear and addresses a real problem. The paper identifies that verbal confidence is poorly calibrated across models and provides a reasonable human analogy for why combining decisiveness with reasoning stability might help. - The proposed two-dimensional confidence framework offers conceptual value. Decomposing confidence into internal certainty and reasoning consistency is interpretable and could potentially reveal different types of model uncertainty. - The paper is well-org
I do have some concerns with the paper that I believe should be addressed: - The authors mention that thresholds α and β are chosen empirically based on the distributions, but they provide no concrete procedure for selecting these values, no sensitivity analysis showing how performance varies with different thresholds, and no solution to the acknowledged generalization problem, which makes me question how we would actually deploy this method in new domains. - The paper uses only 187 samples fro
1. The method is tested on a good amount of datasets (3), LLMs (5, including frontier LLMs), and against 3 baselines. 2. The paper promises to release code upon acceptance for reproducibility.
Weaknesses, in order of magnitude: 1. Some statements in the paper are false: 1. "Across all datasets and models, PCC consistently yields the lowest ECE (Figure 3).", but in Figure 3, PCC is only the best in 4/15 cases. 2. “harmonically combining them [internal certainty and reasoning consistency] yields consistently better calibration.”, actually reasoning consistency has strictly better ECE than the harmonic combination (PCC?) in 8/15 cases 3. “Across Sci-Fact, FeLMWk, and HoVER, P
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
