Limits of Generative Pre-Training in Structured EMR Trajectories with Irregular Sampling
Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm

TL;DR
This study evaluates the limits of generative pre-training models in producing clinically coherent structured electronic medical record trajectories with irregular sampling, revealing they excel at local realism but lack in preserving cross-feature clinical relationships.
Contribution
It demonstrates that autoregressive models trained on EMR data achieve local feature realism but struggle with maintaining clinical coherence across features, emphasizing the need for domain-specific validation.
Findings
Models reproduce feature distributions accurately.
Models fail to preserve cross-feature clinical relationships.
Trajectory synthesis is useful for evaluating model fidelity.
Abstract
Foundation models refer to architectures trained on vast datasets using autoregressive pre-training from natural language processing to capture intricate patterns and motifs. They were originally developed to transfer such learned knowledge to downstream predictive tasks. Recently, however, some studies repurpose these learned representations for phenotype discovery without rigorous validation, risking superficially realistic but clinically incoherent embeddings. To test this mismatch, we trained two autoregressive models -- a sequence-to-sequence LSTM and a reduced Transformer -- on longitudinal ART for HIV and Acute Hypotension datasets. Controlled irregularity was added during training via random inter-visit gaps, while test sequences stayed complete. Patient-trajectory synthesis evaluated distributional and correlational fidelity. Both reproduced feature distributions but failed to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
