Tokenization Tradeoffs in Structured EHR Foundation Models
Lin Lawrence Guo, Santiago Eduardo Arciniegas, Joseph Jihyung Lee, Adam Paul Yan, George Tomlinson, Jason Fries, Lillian Sung

TL;DR
This study investigates how different tokenization strategies affect the performance and efficiency of transformer models trained on structured EHR data, revealing that joint event and time encoding improves results and reduces computational costs.
Contribution
It systematically evaluates tokenization choices in EHR models, demonstrating that joint event and time encoding enhances performance and efficiency across multiple tasks.
Findings
Joint event and time encoding outperform alternatives in 74 clinical prediction tasks.
Joint encoding reduces pretraining floating-point operations by approximately 39.5%.
The encoding advantage generalizes across different patient cohorts despite vocabulary differences.
Abstract
Foundation models for structured electronic health records (EHRs) are pretrained on longitudinal sequences of timestamped clinical events to learn adaptable patient representations. Tokenization -- how these timelines are converted into discrete model inputs -- determines what information is preserved, how efficiently it is encoded, and which relationships must be learned versus precomputed. Yet the impact of tokenization design choices on downstream performance and computational efficiency remains largely unexplored. Here, we pretrained a transformer on pediatric EHR data under a factorial design, varying tokenization along event encoding, time encoding, and workflow annotation. We evaluated area-under-the-receiver-operating-characteristic curve across 74 clinical prediction tasks. Joint event encoding and positional time encoding outperformed their alternatives (73/74 and 71/74 tasks)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Electronic Health Records Systems · Genomics and Rare Diseases
