Optimal Ratio for Data Splitting

V. Roshan Joseph

arXiv:2202.03326·stat.ML·June 10, 2022

Optimal Ratio for Data Splitting

V. Roshan Joseph

PDF

TL;DR

This paper derives an optimal data splitting ratio for training and testing in linear regression, showing it should be proportional to the square root of the number of model parameters.

Contribution

It introduces a theoretical formula for the optimal train-test split ratio based on model complexity, specifically for linear regression models.

Findings

01

Optimal split ratio is approximately √p:1, where p is the number of parameters.

02

Provides a theoretical basis for data splitting in linear models.

03

Guides practitioners on how to allocate data for training and testing.

Abstract

It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article we show that the optimal splitting ratio is $p : 1$ , where $p$ is the number of parameters in a linear regression model that explains the data well.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Regression