Efficient Modeling of Surrogates to Improve Multi-source High-dimensional Biobank Studies
Yue Liu, Molei Liu, Zijian Guo, Tianxi Cai

TL;DR
The paper introduces SASH, a semi-supervised method that leverages unlabeled data and surrogate variables from multiple sites to improve high-dimensional biobank modeling, especially when gold labels are scarce.
Contribution
SASH is a novel semi-supervised approach that combines surrogate-assisted modeling with bias correction and data aggregation to enhance accuracy in high-dimensional biobank studies.
Findings
Outperforms existing methods in simulations.
Effectively integrates multi-site surrogate data.
Successfully applied to diabetes genetic risk modeling.
Abstract
Surrogate variables in electronic health records (EHR) and biobank data play an important role in biomedical studies due to the scarcity or absence of chart-reviewed gold standard labels. We develop a novel approach named SASH for {\bf S}urrogate-{\bf A}ssisted and data-{\bf S}hielding {\bf H}igh-dimensional integrative regression. It is a semi-supervised approach that efficiently leverages sizable unlabeled samples with error-prone EHR surrogate outcomes from multiple local sites, to improve the learning accuracy of the small gold-labeled data. {To facilitate stable and efficient knowledge extraction from the surrogates, our method first obtains a preliminary supervised estimator, and then uses it to assist training a regularized single index model (SIM) for the surrogates. Interestingly, through a chain of convex and properly penalized sparse regressions that approximate the SIM loss…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Advanced Causal Inference Techniques
