Cross-Cluster Weighted Forests

Maya Ramchandran; Rajarshi Mukherjee; and Giovanni Parmigiani

arXiv:2105.07610·stat.ML·April 1, 2025·1 cites

Cross-Cluster Weighted Forests

Maya Ramchandran, Rajarshi Mukherjee, and Giovanni Parmigiani

PDF

Open Access 2 Repos

TL;DR

This paper introduces Cross-Cluster Weighted Forests, an ensemble method that improves predictive accuracy and robustness by training on data clusters, especially in heterogeneous biological datasets.

Contribution

The paper proposes a novel ensemble approach that trains Random Forests on data clusters, enhancing performance over traditional methods in heterogeneous datasets.

Findings

01

Significant accuracy improvements over standard Random Forests.

02

Robustness demonstrated across various data scenarios.

03

Effective application to cancer gene expression data.

Abstract

Adapting machine learning algorithms to better handle the presence of clusters or batch effects within training datasets is important across a wide variety of biological applications. This article considers the effect of ensembling Random Forest learners trained on clusters within a dataset with heterogeneity in the distribution of the features. We find that constructing ensembles of forests trained on clusters determined by algorithms such as k-means results in significant improvements in accuracy and generalizability over the traditional Random Forest algorithm. We begin with a theoretical exploration of the benefits of our novel approach, denoted as the Cross-Cluster Weighted Forest, and subsequently empirically examine its robustness to various data-generating scenarios and outcome models. Furthermore, we explore the influence of the data partitioning and ensemble weighting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGene expression and cancer classification · Machine Learning and Data Classification · Statistical Methods and Inference