The All Relevant Feature Selection using Random Forest
Miron B. Kursa, Witold R. Rudnicki

TL;DR
This paper evaluates random forest-based algorithms for all relevant feature selection, demonstrating their effectiveness on synthetic and real gene expression data, and identifying both known and new relevant features.
Contribution
It compares recent random forest wrapper algorithms for all relevant feature selection and applies them to synthetic and gene expression datasets, revealing their practical effectiveness.
Findings
Heuristic algorithms perform close to ideal algorithms in synthetic data.
The algorithms identify relevant features with high accuracy.
New relevant genes were discovered in gene expression data.
Abstract
In this paper we examine the application of the random forest classifier for the all relevant feature selection problem. To this end we first examine two recently proposed all relevant feature selection algorithms, both being a random forest wrappers, on a series of synthetic data sets with varying size. We show that reasonable accuracy of predictions can be achieved and that heuristic algorithms that were designed to handle the all relevant problem, have performance that is close to that of the reference ideal algorithm. Then, we apply one of the algorithms to four families of semi-synthetic data sets to assess how the properties of particular data set influence results of feature selection. Finally we test the procedure using a well-known gene expression data set. The relevance of nearly all previously established important genes was confirmed, moreover the relevance of several new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Evolutionary Algorithms and Applications · Machine Learning and Data Classification
