DANIEL: A Distributed and Scalable Approach for Global Representation Learning with EHR Applications

Zebin Wang; Ziming Gan; Weijing Tang; Zongqi Xia; Tianrun Cai; Tianxi Cai; Junwei Lu

arXiv:2511.02754·stat.ME·November 5, 2025

DANIEL: A Distributed and Scalable Approach for Global Representation Learning with EHR Applications

Zebin Wang, Ziming Gan, Weijing Tang, Zongqi Xia, Tianrun Cai, Tianxi Cai, Junwei Lu

PDF

Open Access

TL;DR

This paper introduces DANIEL, a distributed, scalable, and privacy-preserving framework for learning global representations from large-scale EHR data using an optimized Ising model approach, improving clinical task performance.

Contribution

It develops a novel distributed bi-factored gradient descent method for scalable, privacy-preserving representation learning with Ising models on high-dimensional EHR data.

Findings

01

Superior performance in clinical tasks like phenotyping and clustering

02

Effective handling of high-dimensional, multi-institutional EHR data

03

Enhanced scalability and privacy preservation in representation learning

Abstract

Classical probabilistic graphical models face fundamental challenges in modern data environments, which are characterized by high dimensionality, source heterogeneity, and stringent data-sharing constraints. In this work, we revisit the Ising model, a well-established member of the Markov Random Field (MRF) family, and develop a distributed framework that enables scalable and privacy-preserving representation learning from large-scale binary data with inherent low-rank structure. Our approach optimizes a non-convex surrogate loss function via bi-factored gradient descent, offering substantial computational and communication advantages over conventional convex approaches. We evaluate our algorithm on multi-institutional electronic health record (EHR) datasets from 58,248 patients across the University of Pittsburgh Medical Center (UPMC) and Mass General Brigham (MGB), demonstrating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Privacy-Preserving Technologies in Data · Generative Adversarial Networks and Image Synthesis