Representation Learning of Structured Data for Medical Foundation Models
Vijay Prakash Dwivedi, Viktor Schlegel, Andy T. Liu, Thanh-Tung, Nguyen, Abhinav Ramesh Kashyap, Jeng Wei, Wei-Hsian Yin, Stefan Winkler,, Robby T. Tan

TL;DR
This paper introduces UniStruct, a novel multimodal medical foundation model that improves the representation of structured medical codes in LLMs by adapting tokenization techniques, leading to significant performance gains in healthcare data processing.
Contribution
The paper presents UniStruct, a new architecture that effectively integrates structured medical codes with unstructured text in LLMs through specialized tokenization, addressing a key limitation in medical AI.
Findings
Achieves up to 23% improvement in evaluation metrics.
Gains around 2% in performance due to new tokenization.
Improves over 42% of downstream tasks on EHRSHOT benchmark.
Abstract
Large Language Models (LLMs) have demonstrated remarkable performance across various domains, including healthcare. However, their ability to effectively represent structured non-textual data, such as the alphanumeric medical codes used in records like ICD-10 or SNOMED-CT, is limited and has been particularly exposed in recent research. This paper examines the challenges LLMs face in processing medical codes due to the shortcomings of current tokenization methods. As a result, we introduce the UniStruct architecture to design a multimodal medical foundation model of unstructured text and structured data, which addresses these challenges by adapting subword tokenization techniques specifically for the structured medical codes. Our approach is validated through model pre-training on both an extensive internal medical database and a public repository of structured medical records. Trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Machine Learning in Healthcare · Neural Networks and Applications
