Quantifying Uncertainty in Error Consistency: Towards Reliable Behavioral Comparison of Classifiers

Thomas Klein; Sascha Meyen; Wieland Brendel; Felix A. Wichmann; Kristof Meding

arXiv:2507.06645·q-bio.NC·November 10, 2025

Quantifying Uncertainty in Error Consistency: Towards Reliable Behavioral Comparison of Classifiers

Thomas Klein, Sascha Meyen, Wieland Brendel, Felix A. Wichmann, Kristof Meding

PDF

Open Access 1 Video

TL;DR

This paper enhances the measurement of error consistency (EC) in ML model benchmarking by introducing confidence intervals via bootstrapping and a model relating EC to response copying, enabling more reliable behavioral comparisons.

Contribution

It proposes a method to compute confidence intervals for EC and introduces a model linking EC to response copying, improving the reliability of behavioral benchmarking in ML.

Findings

01

Many reported differences between deep vision models and humans are statistically insignificant.

02

The new methodology allows for more conclusive and reliable behavioral comparisons.

03

Researchers can now design experiments with sufficient power to detect true behavioral differences.

Abstract

Benchmarking models is a key factor for the rapid progress in machine learning (ML) research. Thus, further progress depends on improving benchmarking metrics. A standard metric to measure the behavioral alignment between ML models and human observers is error consistency (EC). EC allows for more fine-grained comparisons of behavior than other metrics such as accuracy, and has been used in the influential Brain-Score benchmark to rank different DNNs by their behavioral consistency with humans. Previously, EC values have been reported without confidence intervals. However, empirically measured EC values are typically noisy -- thus, without confidence intervals, valid benchmarking conclusions are problematic. Here we improve on standard EC in two ways: First, we show how to obtain confidence intervals for EC using a bootstrapping technique, allowing us to derive significance tests for EC.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Quantifying Uncertainty in Error Consistency: Towards Reliable Behavioral Comparison of Classifiers· slideslive

Taxonomy

TopicsEEG and Brain-Computer Interfaces · Explainable Artificial Intelligence (XAI) · Neural and Behavioral Psychology Studies