Cross-Cluster Weighted Forests
Maya Ramchandran, Rajarshi Mukherjee, and Giovanni Parmigiani

TL;DR
This paper introduces Cross-Cluster Weighted Forests, an ensemble method that improves predictive accuracy and robustness by training on data clusters, especially in heterogeneous biological datasets.
Contribution
The paper proposes a novel ensemble approach that trains Random Forests on data clusters, enhancing performance over traditional methods in heterogeneous datasets.
Findings
Significant accuracy improvements over standard Random Forests.
Robustness demonstrated across various data scenarios.
Effective application to cancer gene expression data.
Abstract
Adapting machine learning algorithms to better handle the presence of clusters or batch effects within training datasets is important across a wide variety of biological applications. This article considers the effect of ensembling Random Forest learners trained on clusters within a dataset with heterogeneity in the distribution of the features. We find that constructing ensembles of forests trained on clusters determined by algorithms such as k-means results in significant improvements in accuracy and generalizability over the traditional Random Forest algorithm. We begin with a theoretical exploration of the benefits of our novel approach, denoted as the Cross-Cluster Weighted Forest, and subsequently empirically examine its robustness to various data-generating scenarios and outcome models. Furthermore, we explore the influence of the data partitioning and ensemble weighting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Machine Learning and Data Classification · Statistical Methods and Inference
