When +1% Is Not Enough: A Paired Bootstrap Protocol for Evaluating Small Improvements
Wenzhang Du

TL;DR
The paper introduces a conservative evaluation protocol using paired bootstrap and permutation tests to reliably assess small improvements in machine learning, addressing issues of variability and over-claiming.
Contribution
It proposes a simple, PC-friendly evaluation method based on paired multi-seed runs and bootstrap confidence intervals to improve the reliability of small gain claims.
Findings
Single runs often suggest significant gains for small improvements.
Paired protocol with three seeds rarely declares significance, emphasizing conservativeness.
The protocol is effective across multiple datasets and scenarios.
Abstract
Recent machine learning papers often report 1-2 percentage point improvements from a single run on a benchmark. These gains are highly sensitive to random seeds, data ordering, and implementation details, yet are rarely accompanied by uncertainty estimates or significance tests. It is therefore unclear when a reported +1-2% reflects a real algorithmic advance versus noise. We revisit this problem under realistic compute budgets, where only a few runs are affordable. We propose a simple, PC-friendly evaluation protocol based on paired multi-seed runs, bias-corrected and accelerated (BCa) bootstrap confidence intervals, and a sign-flip permutation test on per-seed deltas. The protocol is intentionally conservative and is meant as a guardrail against over-claiming. We instantiate it on CIFAR-10, CIFAR-10N, and AG News using synthetic no-improvement, small-gain, and medium-gain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Machine Learning and Data Classification · Explainable Artificial Intelligence (XAI)
