Multimodal Medical Code Tokenizer

Xiaorui Su; Shvat Messica; Yepeng Huang; Ruth Johnson; Lukas Fesser; Shanghua Gao; Faryad Sahneh; Marinka Zitnik

arXiv:2502.04397·cs.CL·July 1, 2025·2 cites

Multimodal Medical Code Tokenizer

Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik

PDF

Open Access

TL;DR

MedTok is a multimodal tokenizer for medical codes that leverages textual descriptions and relational data, significantly enhancing the performance of EHR models and medical QA systems.

Contribution

Introduces MedTok, a novel multimodal tokenizer combining text and relational information for medical codes, improving EHR model performance and enabling better clinical reasoning.

Findings

01

AUPRC improved by up to 11.32% across datasets.

02

Enhanced performance in drug recommendation and diagnosis tasks.

03

Effective integration with medical QA systems.

Abstract

Foundation models trained on patient electronic health records (EHRs) require tokenizing medical data into sequences of discrete vocabulary items. Existing tokenizers treat medical codes from EHRs as isolated textual tokens. However, each medical code is defined by its textual description, its position in ontological hierarchies, and its relationships to other codes, such as disease co-occurrences and drug-treatment associations. Medical vocabularies contain more than 600,000 codes with critical information for clinical reasoning. We introduce MedTok, a multimodal medical code tokenizer that uses the text descriptions and relational context of codes. MedTok processes text using a language model encoder and encodes the relational structure with a graph encoder. It then quantizes both modalities into a unified token space, preserving modality-specific and cross-modality information. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsElectronic Health Records Systems · Artificial Intelligence in Healthcare