Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights

Yi Chen; Daiwei Chen; Sukrut Madhav Chikodikar; Caitlyn Heqi Yin; Ramya Korlakai Vinayak

arXiv:2603.16817·cs.AI·March 18, 2026

Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights

Yi Chen, Daiwei Chen, Sukrut Madhav Chikodikar, Caitlyn Heqi Yin, Ramya Korlakai Vinayak

PDF

Open Access

TL;DR

This paper evaluates the robustness and usefulness of conformal factuality filtering in RAG-based LLMs, revealing limitations under distribution shifts and proposing more efficient verification methods with better utility.

Contribution

It introduces novel informativeness-aware metrics, systematically analyzes conformal factuality's limitations, and compares lightweight verifiers to LLM-based scorers for improved reliability and efficiency.

Findings

01

Conformal filtering has low usefulness at high factuality levels due to vacuous outputs.

02

Factuality guarantees are fragile under distribution shifts and distractors.

03

Lightweight entailment-based verifiers outperform LLM-based confidence scorers in efficiency.

Abstract

Large language models (LLMs) frequently hallucinate, limiting their reliability in knowledge-intensive applications. Retrieval-augmented generation (RAG) and conformal factuality have emerged as potential ways to address this limitation. While RAG aims to ground responses in retrieved evidence, it provides no statistical guarantee that the final output is correct. Conformal factuality filtering offers distribution-free statistical reliability by scoring and filtering atomic claims using a threshold calibrated on held-out data, however, the informativeness of the final output is not guaranteed. We systematically analyze the reliability and usefulness of conformal factuality for RAG-based LLMs across generation, scoring, calibration, robustness, and efficiency. We propose novel informativeness-aware metrics that better reflect task utility under conformal filtering. Across three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)