Random feature baselines provide distributional performance and feature   selection benchmarks for clinical and 'omic machine learning

Randall J. Ellis; Audrey Airaud; Chirag J. Patel

arXiv:2411.10574·q-bio.QM·November 28, 2024

Random feature baselines provide distributional performance and feature selection benchmarks for clinical and 'omic machine learning

Randall J. Ellis, Audrey Airaud, Chirag J. Patel

PDF

Open Access 1 Repo

TL;DR

This study demonstrates that random feature baselines can serve as effective benchmarks for distributional performance and feature selection in high-dimensional biomedical machine learning, challenging traditional feature importance assumptions.

Contribution

It introduces the concept of random feature baselines (RFBs) and evaluates their performance across numerous disease prediction tasks in the UK Biobank, highlighting their utility as benchmarks.

Findings

01

RFBs perform similarly to published protein features in disease prediction.

02

In some cases, RFBs outperform all proteins in AUROC.

03

Using RFBs can inform feature selection and target discovery practices.

Abstract

Identifying predictive features from high-dimensional datasets is a major task in biomedical research. However, it is difficult to determine the robustness of selected features. Here, we investigate the performance of randomly chosen features, what we term "random feature baselines" (RFBs), in the context of disease risk prediction from blood plasma proteomics data in the UK Biobank. We examine two published case studies predicting diagnosis of (1) dementia and (2) hip fracture. RFBs perform similarly to published proteins of interest (using the same number, randomly chosen). We then measure the performance of RFBs for all 607 disease outcomes in the UK Biobank, with various numbers of randomly chosen features, as well as all proteins in the dataset. 114/607 outcomes showed a higher mean AUROC when choosing 5 random features than using all proteins, and the absolute difference in mean…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RandallJEllis/ml4h_2024
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGene expression and cancer classification

MethodsFeature Selection