Is Cross-Validation the Gold Standard to Evaluate Model Performance?
Garud Iyengar, Henry Lam, Tianyu Wang

TL;DR
This paper critically examines the statistical advantages of cross-validation over simple plug-in methods for model evaluation, revealing that CV often does not outperform plug-in in bias and coverage, especially in nonparametric settings.
Contribution
The paper provides a theoretical comparison between cross-validation and plug-in methods, showing CV's limitations and introducing a novel higher-order Taylor analysis for evaluation.
Findings
K-fold CV does not outperform plug-in in bias and coverage.
Leave-one-out CV offers negligible bias improvement over plug-in.
Numerical results confirm plug-in's competitive performance across examples.
Abstract
Cross-Validation (CV) is the default choice for evaluating the performance of machine learning models. Despite its wide usage, their statistical benefits have remained half-understood, especially in challenging nonparametric regimes. In this paper we fill in this gap and show that in fact, for a wide spectrum of models, CV does not statistically outperform the simple "plug-in" approach where one reuses training data for testing evaluation. Specifically, in terms of both the asymptotic bias and coverage accuracy of the associated interval for out-of-sample evaluation, -fold CV provably cannot outperform plug-in regardless of the rate at which the parametric or nonparametric models converge. Leave-one-out CV can have a smaller bias as compared to plug-in; however, this bias improvement is negligible compared to the variability of the evaluation, and in some important cases…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvaluation and Performance Assessment
