
TL;DR
This paper derives an optimal data splitting ratio for training and testing in linear regression, showing it should be proportional to the square root of the number of model parameters.
Contribution
It introduces a theoretical formula for the optimal train-test split ratio based on model complexity, specifically for linear regression models.
Findings
Optimal split ratio is approximately √p:1, where p is the number of parameters.
Provides a theoretical basis for data splitting in linear models.
Guides practitioners on how to allocate data for training and testing.
Abstract
It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article we show that the optimal splitting ratio is , where is the number of parameters in a linear regression model that explains the data well.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Regression
