Constructing Confidence Intervals for 'the' Generalization Error -- a   Comprehensive Benchmark Study

Hannah Schulz-K\"umpel; Sebastian Fischer; Roman Hornung; Anne-Laure; Boulesteix; Thomas Nagler; Bernd Bischl

arXiv:2409.18836·stat.ML·January 16, 2025·3 cites

Constructing Confidence Intervals for 'the' Generalization Error -- a Comprehensive Benchmark Study

Hannah Schulz-K\"umpel, Sebastian Fischer, Roman Hornung, Anne-Laure, Boulesteix, Thomas Nagler, Bernd Bischl

PDF

Open Access 1 Repo

TL;DR

This comprehensive benchmark study evaluates 13 methods for constructing confidence intervals for the generalization error across diverse machine learning problems, providing insights into their reliability, efficiency, and practical recommendations.

Contribution

The paper offers the first large-scale empirical comparison of CI methods for generalization error, including a unified review and benchmarking datasets for future research.

Findings

01

Identifies a subset of reliable CI methods based on coverage and width.

02

Provides a benchmarking suite and code for reproducibility and further studies.

03

Highlights the trade-offs between runtime and accuracy among methods.

Abstract

When assessing the quality of prediction models in machine learning, confidence intervals (CIs) for the generalization error, which measures predictive performance, are a crucial tool. Luckily, there exist many methods for computing such CIs and new promising approaches are continuously being proposed. Typically, these methods combine various resampling procedures, most popular among them cross-validation and bootstrapping, with different variance estimation techniques. Unfortunately, however, there is currently no consensus on when any of these combinations may be most reliably employed and how they generally compare. In this work, we conduct a large-scale study comparing CIs for the generalization error, the first one of such size, where we empirically evaluate 13 different CI methods on a total of 19 tabular regression and classification problems, using seven different inducers and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mlr-org/mlr3inferr
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Measurement and Uncertainty Evaluation · Aerospace and Aviation Technology