Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals

Jon-Paul Cacioli

arXiv:2604.17714·cs.CL·April 21, 2026

Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals

Jon-Paul Cacioli

PDF

1 Repo

TL;DR

This paper introduces a portable protocol inspired by clinical assessment to evaluate the validity of LLM confidence signals, ensuring they carry meaningful item-level information before use.

Contribution

It adapts clinical validity screening principles into a benchmark-based protocol with specific indices and classification system for LLM confidence data.

Findings

01

Four models classified as Invalid, two as Indeterminate.

02

Valid-profile models have mean r = .18, significant in 15/16 cases.

03

Cross-benchmark validation confirms the protocol's transferability.

Abstract

LLM confidence signals are used for abstention, routing, and safety-critical decisions. No standard practice exists for checking whether a confidence signal carries item-level information before building on it. We transfer the validity screening principle from clinical personality assessment (PAI, MMPI-3) as a portable protocol for benchmark-based LLM confidence data. The protocol specifies three core indices (L, Fp, RBS), a structural indicator (TRIN), and an item-sensitivity statistic, computed from a single 2x2 contingency table. A three-tier classification system (Invalid, Indeterminate, Valid) draws on four clinical traditions. Validated on 20 frontier LLMs across 524 items, four models are classified Invalid, two Indeterminate. Valid-profile models show mean r = .18 (15/16 significant). Invalid-profile models show mean r = -.20 (d = 2.48). Cross-benchmark validation on 18 models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

synthiumjp/validity-scaling-llm
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.