TL;DR
This paper derives a theoretical upper bound for regression model performance considering experimental noise, validated through simulations and applied to biological datasets, providing a universal performance limit estimate.
Contribution
It introduces a noise-dependent upper bound for regression performance, applicable across various fields with noisy response variables.
Findings
The upper bound accurately predicts maximum achievable R2.
Monte Carlo simulations validate the theoretical estimate.
Application to biological data demonstrates practical utility.
Abstract
A challenge in developing machine learning regression models is that it is difficult to know whether maximal performance has been reached on a particular dataset, or whether further model improvement is possible. In biology this problem is particularly pronounced as sample labels (response variables) are typically obtained through experiments and therefore have experiment noise associated with them. Such label noise puts a fundamental limit to the performance attainable by regression models. We address this challenge by deriving a theoretical upper bound for the coefficient of determination (R2) for regression models. This theoretical upper bound depends only on the noise associated with the response variable in a dataset as well as its variance. The upper bound estimate was validated via Monte Carlo simulations and then used as a tool to bootstrap performance of regression models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
