Distributed Multivariate Regression Modeling For Selecting Biomarkers Under Data Protection Constraints
Daniela Z\"oller, Harald Binder

TL;DR
This paper introduces a distributed multivariate regression method for biomarker selection that operates under strict data protection constraints, enabling joint analysis without sharing individual-level data.
Contribution
It presents a novel iterative variable selection approach using only aggregated data, with a heuristic variant reducing data calls, implemented within the DataSHIELD framework.
Findings
Method achieves results equivalent to pooled data analysis with local standardization.
Heuristic reduces data calls from over 10 to 3 in typical scenarios.
Minimal information loss with local standardization in simulations.
Abstract
The discovery of clinical biomarkers requires large patient cohorts and is aided by a pooled data approach across institutions. In many countries, data protection constraints, especially in the clinical environment, forbid the exchange of individual-level data between different research institutes, impeding the conduct of a joint analyses. To circumvent this problem, only non-disclosive aggregated data is exchanged, which is often done manually and requires explicit permission before transfer, i.e., the number of data calls and the amount of data should be limited. This does not allow for more complex tasks such as variable selection, as only simple aggregated summary statistics are typically transferred. Other methods have been proposed that require more complex aggregated data or use input data perturbation, but these methods can either not deal with a high number of biomarkers or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare · Statistical Methods in Clinical Trials · Biomedical Text Mining and Ontologies
