Trees, forests, and impurity-based variable importance

Erwan Scornet (CMAP)

arXiv:2001.04295·math.ST·December 28, 2021·27 cites

Trees, forests, and impurity-based variable importance

Erwan Scornet (CMAP)

PDF

Open Access

TL;DR

This paper analyzes the theoretical foundations of the Mean Decrease Impurity variable importance in random forests, clarifying what it estimates and its behavior under different data dependencies.

Contribution

It provides a rigorous variance decomposition of MDI under independence and explores its limitations with dependent variables or interactions.

Findings

01

MDI provides a clear variance decomposition when variables are independent.

02

MDI's interpretation becomes problematic with dependent variables or interactions.

03

Using forests can have benefits over single trees in variable importance analysis.

Abstract

Tree ensemble methods such as random forests [Breiman, 2001] are very popular to handle high-dimensional tabular data sets, notably because of their good predictive accuracy. However, when machine learning is used for decision-making problems, settling for the best predictive procedures may not be reasonable since enlightened decisions require an in-depth comprehension of the algorithm prediction process. Unfortunately, random forests are not intrinsically interpretable since their prediction results from averaging several hundreds of decision trees. A classic approach to gain knowledge on this so-called black-box algorithm is to compute variable importances, that are employed to assess the predictive impact of each input variable. Variable importances are then used to rank or select variables and thus play a great role in data analysis. Nevertheless, there is no justification to use…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Face and Expression Recognition · Statistical Methods and Inference