Random feature baselines provide distributional performance and feature selection benchmarks for clinical and 'omic machine learning
Randall J. Ellis, Audrey Airaud, Chirag J. Patel

TL;DR
This study demonstrates that random feature baselines can serve as effective benchmarks for distributional performance and feature selection in high-dimensional biomedical machine learning, challenging traditional feature importance assumptions.
Contribution
It introduces the concept of random feature baselines (RFBs) and evaluates their performance across numerous disease prediction tasks in the UK Biobank, highlighting their utility as benchmarks.
Findings
RFBs perform similarly to published protein features in disease prediction.
In some cases, RFBs outperform all proteins in AUROC.
Using RFBs can inform feature selection and target discovery practices.
Abstract
Identifying predictive features from high-dimensional datasets is a major task in biomedical research. However, it is difficult to determine the robustness of selected features. Here, we investigate the performance of randomly chosen features, what we term "random feature baselines" (RFBs), in the context of disease risk prediction from blood plasma proteomics data in the UK Biobank. We examine two published case studies predicting diagnosis of (1) dementia and (2) hip fracture. RFBs perform similarly to published proteins of interest (using the same number, randomly chosen). We then measure the performance of RFBs for all 607 disease outcomes in the UK Biobank, with various numbers of randomly chosen features, as well as all proteins in the dataset. 114/607 outcomes showed a higher mean AUROC when choosing 5 random features than using all proteins, and the absolute difference in mean…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification
MethodsFeature Selection
