Limits of Generative Pre-Training in Structured EMR Trajectories with Irregular Sampling

Nicholas I-Hsien Kuo; Blanca Gallego; Louisa Jorm

arXiv:2510.22878·cs.LG·October 28, 2025

Limits of Generative Pre-Training in Structured EMR Trajectories with Irregular Sampling

Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm

PDF

TL;DR

This study evaluates the limits of generative pre-training models in producing clinically coherent structured electronic medical record trajectories with irregular sampling, revealing they excel at local realism but lack in preserving cross-feature clinical relationships.

Contribution

It demonstrates that autoregressive models trained on EMR data achieve local feature realism but struggle with maintaining clinical coherence across features, emphasizing the need for domain-specific validation.

Findings

01

Models reproduce feature distributions accurately.

02

Models fail to preserve cross-feature clinical relationships.

03

Trajectory synthesis is useful for evaluating model fidelity.

Abstract

Foundation models refer to architectures trained on vast datasets using autoregressive pre-training from natural language processing to capture intricate patterns and motifs. They were originally developed to transfer such learned knowledge to downstream predictive tasks. Recently, however, some studies repurpose these learned representations for phenotype discovery without rigorous validation, risking superficially realistic but clinically incoherent embeddings. To test this mismatch, we trained two autoregressive models -- a sequence-to-sequence LSTM and a reduced Transformer -- on longitudinal ART for HIV and Acute Hypotension datasets. Controlled irregularity was added during training via random inter-visit gaps, while test sequences stayed complete. Patient-trajectory synthesis evaluated distributional and correlational fidelity. Both reproduced feature distributions but failed to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.