Concurrent Criterion Validation of a Validity Screen for LLM Confidence Signals via Selective Prediction

Jon-Paul Cacioli

arXiv:2604.17716·cs.CL·April 21, 2026

Concurrent Criterion Validation of a Validity Screen for LLM Confidence Signals via Selective Prediction

Jon-Paul Cacioli

PDF

TL;DR

This study validates a classification system for LLM confidence signals, demonstrating its effectiveness in predicting model performance and enabling selective prediction across multiple language models.

Contribution

It provides empirical evidence that the validity screen reliably predicts LLM performance, advancing selective prediction methods.

Findings

01

Valid models have significantly higher AUROC than invalid models.

02

The three-tier classification explains 47% of variance in performance.

03

The validity screen effectively guides selective prediction, improving efficiency.

Abstract

The validity screen (Cacioli, 2026d, 2026e) classifies LLM confidence signals as Valid, Indeterminate, or Invalid. We test whether these classifications predict selective prediction performance. Twenty frontier LLMs from seven families were evaluated on 524 items across six cognitive tracks. Valid models show mean Type 2 AUROC = .624 (SD = .048). Invalid models show mean AUROC = .357 (SD = .231). Cohen's d = 2.81, p = .002. The tiers order monotonically: Invalid (.357) < Indeterminate (.554) < Valid (.624). Split-half cross-validation yields median d = 1.77, P(d > 0) = 1.0 across 1,000 splits. The three-tier classification accounts for 47% of the variance in AUROC. DeepSeek-R1 drops from 85.3% accuracy at full coverage to 11.3% at 10% coverage. The screen predicts the criterion. For selective prediction, the screen matters.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.