Variable importance in binary regression trees and forests

Hemant Ishwaran

arXiv:0711.2434·stat.ML·September 29, 2009

Variable importance in binary regression trees and forests

Hemant Ishwaran

PDF

TL;DR

This paper develops a theoretical framework for understanding variable importance and pairwise associations in binary regression trees and forests, providing insights into their properties and applications in high-dimensional data analysis.

Contribution

It introduces a novel theoretical characterization of variable importance in binary trees and extends it to ensembles like random forests, addressing a gap in existing theory.

Findings

01

The theory applies to random forests and helps interpret variable importance.

02

Provides a basis for screening variables in high-dimensional data.

03

Enhances understanding of variable associations in tree-based models.

Abstract

We characterize and study variable importance (VIMP) and pairwise variable associations in binary regression trees. A key component involves the node mean squared error for a quantity we refer to as a maximal subtree. The theory naturally extends from single trees to ensembles of trees and applies to methods like random forests. This is useful because while importance values from random forests are used to screen variables, for example they are used to filter high throughput genomic data in Bioinformatics, very little theory exists about their properties.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.