When LLMs get significantly worse: A statistical approach to detect model degradations

Jonas K\"ubler; Kailash Budhathoki; Matth\"aus Kleindessner; Xiong Zhou; Junming Yin; Ashish Khetan; George Karypis

arXiv:2602.10144·stat.ML·May 7, 2026

When LLMs get significantly worse: A statistical approach to detect model degradations

Jonas K\"ubler, Kailash Budhathoki, Matth\"aus Kleindessner, Xiong Zhou, Junming Yin, Ashish Khetan, George Karypis

PDF

1 Video

TL;DR

This paper introduces a statistical hypothesis testing framework using McNemar's test to reliably detect model degradations in large language models, ensuring accuracy changes are genuine rather than noise.

Contribution

It proposes a novel, statistically sound method for identifying model degradation by comparing sample-level scores, with aggregation techniques for multiple benchmarks, implemented on open-source tools.

Findings

01

The method correctly flags degraded models while ignoring harmless optimizations.

02

It can detect accuracy degradations as small as 0.3%.

03

The approach guarantees a controlled false positive rate.

Abstract

Minimizing the inference cost and latency of foundation models has become a crucial area of research. Optimization approaches include theoretically lossless methods and others without accuracy guarantees like quantization. In all of these cases it is crucial to ensure that the model quality has not degraded. However, even at temperature zero, model generations are not necessarily robust even to theoretically lossless model optimizations due to numerical errors. We thus require statistical tools to decide whether a finite-sample accuracy deviation is an evidence of a model's degradation or whether it can be attributed to (harmless) noise in the evaluation. We propose a statistically sound hypothesis testing framework based on McNemar's test allowing to efficiently detect model degradations, while guaranteeing a controlled rate of false positives. The crucial insight is that we have to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

When LLMs get significantly worse: A statistical approach to detect model degradations· slideslive