Efficient Modeling of Surrogates to Improve Multi-source   High-dimensional Biobank Studies

Yue Liu; Molei Liu; Zijian Guo; Tianxi Cai

arXiv:2302.04970·stat.ME·September 4, 2023

Efficient Modeling of Surrogates to Improve Multi-source High-dimensional Biobank Studies

Yue Liu, Molei Liu, Zijian Guo, Tianxi Cai

PDF

Open Access

TL;DR

The paper introduces SASH, a semi-supervised method that leverages unlabeled data and surrogate variables from multiple sites to improve high-dimensional biobank modeling, especially when gold labels are scarce.

Contribution

SASH is a novel semi-supervised approach that combines surrogate-assisted modeling with bias correction and data aggregation to enhance accuracy in high-dimensional biobank studies.

Findings

01

Outperforms existing methods in simulations.

02

Effectively integrates multi-site surrogate data.

03

Successfully applied to diabetes genetic risk modeling.

Abstract

Surrogate variables in electronic health records (EHR) and biobank data play an important role in biomedical studies due to the scarcity or absence of chart-reviewed gold standard labels. We develop a novel approach named SASH for {\bf S}urrogate-{\bf A}ssisted and data-{\bf S}hielding {\bf H}igh-dimensional integrative regression. It is a semi-supervised approach that efficiently leverages sizable unlabeled samples with error-prone EHR surrogate outcomes from multiple local sites, to improve the learning accuracy of the small gold-labeled data. {To facilitate stable and efficient knowledge extraction from the surrogates, our method first obtains a preliminary supervised estimator, and then uses it to assist training a regularized single index model (SIM) for the surrogates. Interestingly, through a chain of convex and properly penalized sparse regressions that approximate the SIM loss…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Advanced Causal Inference Techniques