On feature selection in double-imbalanced data settings: a Random Forest approach
Fabio Demaria

TL;DR
This paper introduces a new minimal depth-based feature selection method for Random Forests tailored to double-imbalanced high-dimensional data, improving stability and accuracy of variable importance rankings.
Contribution
It proposes a novel thresholding scheme based on minimal depth to enhance feature selection stability and interpretability in double-imbalanced settings.
Findings
More parsimonious variable subsets achieved
Improved accuracy over traditional methods
Validated on simulated and real datasets
Abstract
Feature selection is a critical step in high-dimensional classification tasks, particularly under challenging conditions of double imbalance, namely settings characterized by both class imbalance in the response variable and dimensional asymmetry in the data . In such scenarios, traditional feature selection methods applied to Random Forests (RF) often yield unstable or misleading importance rankings. This paper proposes a novel thresholding scheme for feature selection based on minimal depth, which exploits the tree topology to assess variable relevance. Extensive experiments on simulated and real-world datasets demonstrate that the proposed approach produces more parsimonious and accurate subsets of variables compared to conventional minimal depth-based selection. The method provides a practical and interpretable solution for variable selection in RF under double imbalance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Face and Expression Recognition · Financial Distress and Bankruptcy Prediction
MethodsFeature Selection
