Multi-Study Boosting: Theoretical Considerations for Merging vs.   Ensembling

Cathy Shyr; Pragya Sur; Giovanni Parmigiani; Prasad Patil

arXiv:2207.04588·stat.ML·July 14, 2022·1 cites

Multi-Study Boosting: Theoretical Considerations for Merging vs. Ensembling

Cathy Shyr, Pragya Sur, Giovanni Parmigiani, Prasad Patil

PDF

Open Access 1 Repo

TL;DR

This paper explores the theoretical and practical considerations of merging versus ensembling in multi-study boosting, providing guidelines based on heterogeneity and bias-variance trade-offs for better model generalizability.

Contribution

It introduces a theoretical transition point for choosing between merging and ensembling in boosting with linear learners, supported by simulations and real data application.

Findings

01

Theoretical transition point guides merging vs. ensembling decisions.

02

Bias-variance analysis clarifies error components in boosting.

03

Simulation and breast cancer data validate the guidelines.

Abstract

Cross-study replicability is a powerful model evaluation criterion that emphasizes generalizability of predictions. When training cross-study replicable prediction models, it is critical to decide between merging and treating the studies separately. We study boosting algorithms in the presence of potential heterogeneity in predictor-outcome relationships across studies and compare two multi-study learning strategies: 1) merging all the studies and training a single model, and 2) multi-study ensembling, which involves training a separate model on each study and ensembling the resulting predictions. In the regression setting, we provide theoretical guidelines based on an analytical transition point to determine whether it is more beneficial to merge or to ensemble for boosting with linear learners. In addition, we characterize a bias-variance decomposition of estimation error for boosting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wangcathy/multi-study-boosting
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Advanced Causal Inference Techniques · Gene expression and cancer classification