Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

Jon-Paul Cacioli

arXiv:2604.27405·cs.CL·May 1, 2026

Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

Jon-Paul Cacioli

PDF

TL;DR

This paper introduces a method adapted from clinical psychology to detect reliable, item-level changes in large language models, revealing nuanced performance shifts overlooked by aggregate metrics.

Contribution

It applies the Reliable Change Index to LLM evaluation, providing a detailed analysis of item-level changes and highlighting limitations of traditional aggregate accuracy measures.

Findings

01

Most items showed no reliable change (79% and 72%)

02

Over half the items were floor or ceiling effects

03

Significant bidirectional change with large effect sizes

Abstract

We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.1 (+1.6 points) and Qwen 2.5 to 3 (+2.8 points). On the full benchmark, most items showed no reliable change (79% and 72%). However, over half the items were floor/ceiling. Among analysable items, change was bidirectional with large effect sizes: 34% improved and 28% deteriorated for Llama; 47% improved and 39% deteriorated for Qwen (median |delta p| = 0.50 and 0.90). Churn was asymmetric by difficulty: low-accuracy items improved, high-accuracy items deteriorated. Domain-level decomposition revealed family-specific reversals: Llama lost physics while Qwen lost law. Greedy single-shot evaluation missed 42% of reliably changed items and falsely flagged 25% of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.