Does Data Splitting Improve Prediction?
Julian J. Faraway

TL;DR
This paper compares data splitting and full data strategies for constructing predictive distributions, analyzing their performance and proposing a hybrid estimator called SAFE to optimize prediction accuracy.
Contribution
It introduces a detailed decomposition of predictive scoring and proposes the SAFE hybrid estimator to balance model selection and parameter estimation.
Findings
Data splitting reduces data reuse costs in high-reuse scenarios.
Full data strategy can outperform splitting when data reuse costs are low.
The SAFE estimator improves predictive performance by combining data splitting and full data approaches.
Abstract
Data splitting divides data into two parts. One part is reserved for model selection. In some applications, the second part is used for model validation but we use this part for estimating the parameters of the chosen model. We focus on the problem of constructing reliable predictive distributions for future observed values. We judge the predictive performance using log scoring. We compare the full data strategy with the data splitting strategy for prediction. We show how the full data score can be decomposed into model selection, parameter estimation and data reuse costs. Data splitting is preferred when data reuse costs are high. We investigate the relative performance of the strategies in four simulation scenarios. We introduce a hybrid estimator called SAFE that uses one part for model selection but both parts for estimation. We discuss the choice to use a split data analysis versus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
