Provable Boolean Interaction Recovery from Tree Ensemble obtained via Random Forests
Merle Behr, Yu Wang, Xiao Li, and Bin Yu

TL;DR
This paper provides a theoretical foundation for how Random Forests can reliably discover Boolean feature interactions, introducing the LSS model and proving the consistency of the LSSFind method for interaction recovery.
Contribution
It introduces the LSS model to capture biological thresholding behavior and proves that the LSSFind algorithm consistently recovers Boolean interactions from RF ensembles under this model.
Findings
DWP(S) bounds characterize Boolean interactions in RF
LSSFind recovers interactions consistently as sample size grows
Simulation confirms robustness even with assumption violations
Abstract
Random Forests (RF) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative Random Forests (iRF) use a tree ensemble from iteratively modified RF to obtain predictive and stable non-linear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a novel discontinuous nonlinear regression model, called the Locally Spiky Sparse (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
