Constructing Confidence Intervals for 'the' Generalization Error -- a Comprehensive Benchmark Study
Hannah Schulz-K\"umpel, Sebastian Fischer, Roman Hornung, Anne-Laure, Boulesteix, Thomas Nagler, Bernd Bischl

TL;DR
This comprehensive benchmark study evaluates 13 methods for constructing confidence intervals for the generalization error across diverse machine learning problems, providing insights into their reliability, efficiency, and practical recommendations.
Contribution
The paper offers the first large-scale empirical comparison of CI methods for generalization error, including a unified review and benchmarking datasets for future research.
Findings
Identifies a subset of reliable CI methods based on coverage and width.
Provides a benchmarking suite and code for reproducibility and further studies.
Highlights the trade-offs between runtime and accuracy among methods.
Abstract
When assessing the quality of prediction models in machine learning, confidence intervals (CIs) for the generalization error, which measures predictive performance, are a crucial tool. Luckily, there exist many methods for computing such CIs and new promising approaches are continuously being proposed. Typically, these methods combine various resampling procedures, most popular among them cross-validation and bootstrapping, with different variance estimation techniques. Unfortunately, however, there is currently no consensus on when any of these combinations may be most reliably employed and how they generally compare. In this work, we conduct a large-scale study comparing CIs for the generalization error, the first one of such size, where we empirically evaluate 13 different CI methods on a total of 19 tabular regression and classification problems, using seven different inducers and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Measurement and Uncertainty Evaluation · Aerospace and Aviation Technology
