Estimating the theoretical error rate for prediction
Herman Chernoff, Shaw-Hwa Lo, Tian Zheng, Adeline Lo

TL;DR
This paper introduces a theoretical framework for estimating the error rate in prediction tasks, emphasizing the importance of variable interactions and proposing bias corrections to improve predictivity estimates, especially in high-dimensional data like GWAS.
Contribution
It defines theoretical predictivity, relates it to a new statistic I for variable selection, and develops bias correction methods to enhance prediction accuracy in large variable datasets.
Findings
Statistic I correlates better with predictivity than significance levels.
Bias corrections improve estimates of potential predictivity.
Application to GWAS data reduced error rate from 30% to 8%.
Abstract
Prediction for very large data sets is typically carried out in two stages, variable selection and pattern recognition. Ordinarily variable selection involves seeing how well individual explanatory variables are correlated with the dependent variable. This practice neglects the possible interactions among the variables. Simulations have shown that a statistic I, that we used for variable selection is much better correlated with predictivity than significance levels. We explain this by defining theoretical predictivity and show how I is related to predictivity. We calculate the biases of the overoptimistic training estimate of predictivity and of the pessimistic out of sample estimate. Corrections for the bias lead to improved estimates of the potential predictivity using small groups of possibly interacting variables. These results support the use of I in the variable selection phase of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Bioinformatics and Genomic Networks · Metabolomics and Mass Spectrometry Studies
