Semantic Decomposition Improves Learning of Large Language Models on EHR Data
David A. Bloore, Romane Gauriau, Anna L. Decker, Jacob Oppenheim

TL;DR
This paper introduces H-BERT, a novel method that decomposes hierarchical medical codes in EHR data using semantic units and graph structures, significantly enhancing disease prediction accuracy and patient representation.
Contribution
The paper presents H-BERT, a new approach that incorporates complete hierarchical graph expansions of medical codes into BERT, improving predictive performance and phenotypic differentiation.
Findings
Improved prediction of over 500 diagnosis classes.
Enhanced patient representation in clinical phenotypes.
Significant gains in AUC and APS metrics.
Abstract
Electronic health records (EHR) are widely believed to hold a profusion of actionable insights, encrypted in an irregular, semi-structured format, amidst a loud noise background. To simplify learning patterns of health and disease, medical codes in EHR can be decomposed into semantic units connected by hierarchical graphs. Building on earlier synergy between Bidirectional Encoder Representations from Transformers (BERT) and Graph Attention Networks (GAT), we present H-BERT, which ingests complete graph tree expansions of hierarchical medical codes as opposed to only ingesting the leaves and pushes patient-level labels down to each visit. This methodology significantly improves prediction of patient membership in over 500 medical diagnosis classes as measured by aggregated AUC and APS, and creates distinct representations of patients in closely related but clinically distinct phenotypes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Topic Modeling · Biomedical Text Mining and Ontologies
