Representation learning to advance multi-institutional studies with electronic health record data from US and France

Doudou Zhou; Han Tong; Linshanshan Wang; Suqi Liu; Xin Xiong; Ziming Gan; Romain Griffier; Boris Hejblum; Yun-Chung Liu; Chuan Hong; Clara-Lea Bonzel; Tianrun Cai; Kevin Pan; Yuk-Lam Ho; Lauren Costa; Vidul A. Panickan; J. Michael Gaziano; Kenneth Mandl; Vianney Jouhet; Rodolphe Thiebaut; Zongqi Xia; Kelly Cho; Katherine Liao; Tianxi Cai

arXiv:2502.08547·cs.AI·April 7, 2026

Representation learning to advance multi-institutional studies with electronic health record data from US and France

Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet

PDF

TL;DR

This paper presents a graph-based, privacy-preserving framework for harmonizing heterogeneous electronic health record data across multiple institutions using scalable representation learning techniques.

Contribution

It introduces a novel method that integrates institution-specific data, biomedical knowledge graphs, and language models to learn a shared semantic space without manual mappings.

Findings

01

Framework successfully harmonized data across seven institutions and two languages.

02

It improved the consistency of clinical concept representations across sites.

03

The approach enhances collaborative clinical research without compromising privacy.

Abstract

The widespread adoption of electronic health records has created new opportunities for translational clinical research, yet this promise remains constrained by fragmented data across privacy-siloed institutions and substantial heterogeneity in local coding practices. While privacy-preserving collaborative learning allows institutions to work together without sharing patient-level data, it does not address inconsistencies in how clinical concepts are represented across sites. We introduce a graph-based framework that addresses this gap by treating data harmonization as a scalable representation learning problem. Rather than relying on fixed standards or manual mappings, the framework integrates institution-specific summary statistics from health records, curated biomedical knowledge graphs, and semantic information derived from large language models to learn a shared semantic space. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.