Random Forest Variable Importance-based Selection Algorithm in Class Imbalance Problem
Yunbi Nam, Sunwoo Han

TL;DR
This paper investigates how class balancing techniques affect Random Forest variable importance measures and introduces a new feature selection algorithm that improves prediction accuracy in imbalanced classification tasks.
Contribution
It studies the impact of class balancing on RF variable importance and proposes a novel selection algorithm utilizing importance confidence intervals for better feature selection.
Findings
Over-sampling improves importance measurement in small, imbalanced datasets.
Under-sampling fails to distinguish important variables.
Proposed algorithm enhances prediction performance with optimal feature sets.
Abstract
Random Forest is a machine learning method that offers many advantages, including the ability to easily measure variable importance. Class balancing technique is a well-known solution to deal with class imbalance problem. However, it has not been actively studied on RF variable importance. In this paper, we study the effect of class balancing on RF variable importance. Our simulation results show that over-sampling is effective in correctly measuring variable importance in class imbalanced situations with small sample size, while under-sampling fails to differentiate important and non-informative variables. We then propose a variable selection algorithm that utilizes RF variable importance and its confidence interval. Through an experimental study using many real and artificial datasets, we demonstrate that our proposed algorithm efficiently selects an optimal feature set, leading to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Grey System Theory Applications · Financial Distress and Bankruptcy Prediction
