Trees, forests, and impurity-based variable importance
Erwan Scornet (CMAP)

TL;DR
This paper analyzes the theoretical foundations of the Mean Decrease Impurity variable importance in random forests, clarifying what it estimates and its behavior under different data dependencies.
Contribution
It provides a rigorous variance decomposition of MDI under independence and explores its limitations with dependent variables or interactions.
Findings
MDI provides a clear variance decomposition when variables are independent.
MDI's interpretation becomes problematic with dependent variables or interactions.
Using forests can have benefits over single trees in variable importance analysis.
Abstract
Tree ensemble methods such as random forests [Breiman, 2001] are very popular to handle high-dimensional tabular data sets, notably because of their good predictive accuracy. However, when machine learning is used for decision-making problems, settling for the best predictive procedures may not be reasonable since enlightened decisions require an in-depth comprehension of the algorithm prediction process. Unfortunately, random forests are not intrinsically interpretable since their prediction results from averaging several hundreds of decision trees. A classic approach to gain knowledge on this so-called black-box algorithm is to compute variable importances, that are employed to assess the predictive impact of each input variable. Variable importances are then used to rank or select variables and thus play a great role in data analysis. Nevertheless, there is no justification to use…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Face and Expression Recognition · Statistical Methods and Inference
