MDA for random forests: inconsistency, and a practical solution via the Sobol-MDA
Cl\'ement B\'enard (LPSM (UMR\_8001)), S\'ebastien da Veiga, Erwan, Scornet (CMAP)

TL;DR
This paper analyzes the statistical properties of the mean decrease accuracy (MDA) in random forests, reveals its limitations in dependent covariate settings, and introduces the Sobol-MDA as a more reliable importance measure with practical benefits.
Contribution
The authors rigorously analyze MDA's asymptotic behavior, identify its flaws under dependence, and propose the Sobol-MDA as a novel, consistent importance measure for random forests.
Findings
MDA converges to different quantities depending on implementation.
Original MDA does not effectively detect influential covariates with dependent features.
Sobol-MDA outperforms existing importance measures in variable selection tasks.
Abstract
Variable importance measures are the main tools to analyze the black-box mechanisms of random forests. Although the mean decrease accuracy (MDA) is widely accepted as the most efficient variable importance measure for random forests, little is known about its statistical properties. In fact, the definition of MDA varies across the main random forest software. In this article, our objective is to rigorously analyze the behavior of the main MDA implementations. Consequently, we mathematically formalize the various implemented MDA algorithms, and then establish their limits when the sample size increases. This asymptotic analysis reveals that these MDA versions differ as importance measures, since they converge towards different quantities. More importantly, we break down these limits into three components: the first two terms are related to Sobol indices, which are well-defined measures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProbabilistic and Robust Engineering Design · Markov Chains and Monte Carlo Methods · Statistical Methods and Inference
