Accounting for Variance in Machine Learning Benchmarks

Xavier Bouthillier; Pierre Delaunay; Mirko Bronzi; Assya Trofimov,; Brennan Nichyporuk; Justin Szeto; Naz Sepah; Edward Raff; Kanika Madan,; Vikram Voleti; Samira Ebrahimi Kahou; Vincent Michalski; Dmitriy Serdyuk; Tal; Arbel; Chris Pal; Ga\"el Varoquaux; Pascal Vincent

arXiv:2103.03098·cs.LG·March 5, 2021·40 cites

Accounting for Variance in Machine Learning Benchmarks

Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov,, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan,, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal, Arbel, Chris Pal, Ga\"el Varoquaux, Pascal Vincent

PDF

Open Access

TL;DR

This paper models the variance in machine learning benchmarking, demonstrating how accounting for sources of variation improves comparison accuracy and proposing cost-effective methods for performance evaluation.

Contribution

It introduces a comprehensive model of benchmarking variance, analyzes comparison methods, and proposes recommendations to improve the reliability of performance assessments.

Findings

01

Variance from data sampling, initialization, and hyperparameters significantly affects results.

02

Adding more sources of variation can reduce estimation error at lower computational costs.

03

Proposed recommendations improve the detection of true performance improvements.

Abstract

Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Machine Learning and Data Classification · Domain Adaptation and Few-Shot Learning