Conformal P-Value in Multiple-Choice Question Answering Tasks with Provable Risk Control
Yuanchang Ye

TL;DR
This paper presents a conformal prediction framework with p-value testing to improve the reliability and factual accuracy of large language models in multiple-choice question answering, ensuring statistically rigorous uncertainty quantification.
Contribution
It introduces a novel integration of conformal prediction and significance testing for LLMs in MCQA, providing provable risk control and improved trustworthiness.
Findings
Achieves user-specified empirical miscoverage rates.
Prediction set size decreases with higher risk levels, validating uncertainty metrics.
Demonstrates effectiveness on MMLU benchmarks with off-the-shelf LLMs.
Abstract
This study introduces a significance testing-enhanced conformal prediction (CP) framework to improve trustworthiness of large language models (LLMs) in multiple-choice question answering (MCQA). While LLMs have been increasingly deployed in disciplinary QA scenarios, hallucination and nonfactual generation substantially compromise response reliability. Although CP provides statistically rigorous marginal coverage guarantees for prediction sets, and significance testing offers established statistical rigor, their synergistic integration remains unexplored. To mitigate hallucination and factual inaccuracies, our framework integrates -value computation with conformity scoring through self-consistency resampling of MCQA responses. This approach calculates option frequencies to address LLMs' black-box nature, subsequently constructing prediction sets via null hypothesis testing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
