When +1% Is Not Enough: A Paired Bootstrap Protocol for Evaluating Small Improvements

Wenzhang Du

arXiv:2511.19794·cs.LG·November 26, 2025

When +1% Is Not Enough: A Paired Bootstrap Protocol for Evaluating Small Improvements

Wenzhang Du

PDF

Open Access

TL;DR

The paper introduces a conservative evaluation protocol using paired bootstrap and permutation tests to reliably assess small improvements in machine learning, addressing issues of variability and over-claiming.

Contribution

It proposes a simple, PC-friendly evaluation method based on paired multi-seed runs and bootstrap confidence intervals to improve the reliability of small gain claims.

Findings

01

Single runs often suggest significant gains for small improvements.

02

Paired protocol with three seeds rarely declares significance, emphasizing conservativeness.

03

The protocol is effective across multiple datasets and scenarios.

Abstract

Recent machine learning papers often report 1-2 percentage point improvements from a single run on a benchmark. These gains are highly sensitive to random seeds, data ordering, and implementation details, yet are rarely accompanied by uncertainty estimates or significance tests. It is therefore unclear when a reported +1-2% reflects a real algorithmic advance versus noise. We revisit this problem under realistic compute budgets, where only a few runs are affordable. We propose a simple, PC-friendly evaluation protocol based on paired multi-seed runs, bias-corrected and accelerated (BCa) bootstrap confidence intervals, and a sign-flip permutation test on per-seed deltas. The protocol is intentionally conservative and is meant as a guardrail against over-claiming. We instantiate it on CIFAR-10, CIFAR-10N, and AG News using synthetic no-improvement, small-gain, and medium-gain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Machine Learning and Data Classification · Explainable Artificial Intelligence (XAI)