Conformal P-Value in Multiple-Choice Question Answering Tasks with Provable Risk Control

Yuanchang Ye

arXiv:2508.10022·cs.CL·August 15, 2025

Conformal P-Value in Multiple-Choice Question Answering Tasks with Provable Risk Control

Yuanchang Ye

PDF

TL;DR

This paper presents a conformal prediction framework with p-value testing to improve the reliability and factual accuracy of large language models in multiple-choice question answering, ensuring statistically rigorous uncertainty quantification.

Contribution

It introduces a novel integration of conformal prediction and significance testing for LLMs in MCQA, providing provable risk control and improved trustworthiness.

Findings

01

Achieves user-specified empirical miscoverage rates.

02

Prediction set size decreases with higher risk levels, validating uncertainty metrics.

03

Demonstrates effectiveness on MMLU benchmarks with off-the-shelf LLMs.

Abstract

This study introduces a significance testing-enhanced conformal prediction (CP) framework to improve trustworthiness of large language models (LLMs) in multiple-choice question answering (MCQA). While LLMs have been increasingly deployed in disciplinary QA scenarios, hallucination and nonfactual generation substantially compromise response reliability. Although CP provides statistically rigorous marginal coverage guarantees for prediction sets, and significance testing offers established statistical rigor, their synergistic integration remains unexplored. To mitigate hallucination and factual inaccuracies, our framework integrates $p$ -value computation with conformity scoring through self-consistency resampling of MCQA responses. This approach calculates option frequencies to address LLMs' black-box nature, subsequently constructing prediction sets via null hypothesis testing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.