The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures
Anne-Claire Haury (CBIO), Pierre Gestraud, Jean-Philippe Vert (CBIO)

TL;DR
This study systematically compares 32 feature selection methods on gene expression data for breast cancer prognosis, revealing that simple filter methods like Student's t-test often outperform complex approaches in accuracy, stability, and interpretability.
Contribution
It provides a comprehensive evaluation of feature selection methods, highlighting the effectiveness of simple filter methods over more complex techniques in biomarker discovery.
Findings
Simple filter methods outperform complex methods in accuracy and stability.
Ensemble feature selection generally does not improve results.
Student's t-test provides the best overall performance.
Abstract
Motivation: Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. Methods: We compare 32 feature selection methods on 4 public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and functional interpretability of the signatures they produce. Results: We observe that the feature selection method has a significant influence on the accuracy, stability and interpretability of signatures. Simple filter methods generally outperform more complex embedded or wrapper methods, and ensemble feature selection has generally no positive effect. Overall a simple Student's t-test seems to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
