Finding Important Genes from High-Dimensional Data: An Appraisal of Statistical Tests and Machine-Learning Approaches
Chamont Wang, Jana Gevertz, Chaur-Chin Chen, Leonardo Auslender

TL;DR
This paper critically evaluates statistical and machine learning methods for identifying important genes in high-dimensional data, highlighting limitations, potential misuses, and proposing stochastic gradient boosting as an effective approach.
Contribution
It provides a comprehensive assessment of existing tools' reliability and introduces stochastic gradient boosting as a superior method for gene selection in high-dimensional datasets.
Findings
Models with 100% accuracy often select different gene sets.
Some models classify data perfectly without using disease-related variables.
Stochastic gradient boosting outperforms other methods with moderate sample sizes.
Abstract
Over the past decades, statisticians and machine-learning researchers have developed literally thousands of new tools for the reduction of high-dimensional data in order to identify the variables most responsible for a particular trait. These tools have applications in a plethora of settings, including data analysis in the fields of business, education, forensics, and biology (such as microarray, proteomics, brain imaging), to name a few. In the present work, we focus our investigation on the limitations and potential misuses of certain tools in the analysis of the benchmark colon cancer data (2,000 variables; Alon et al., 1999) and the prostate cancer data (6,033 variables; Efron, 2010, 2008). Our analysis demonstrates that models that produce 100% accuracy measures often select different sets of genes and cannot stand the scrutiny of parameter estimates and model stability.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Bioinformatics and Genomic Networks · Genetics, Bioinformatics, and Biomedical Research
