Enhancing cause of death prediction: development and validation of machine learning models using multimodal data across multiple health-care sites
Mohammed Al-Garadi, Rishi J Desai, Kerry Ngan, Michele LeNoue-Newton, Ruth M Reeves, Daniel Park, Jose J Hernández-Muñoz, Shirley V Wang, Judith C Maro, Candace C Fuller, Joshua Lin Kueiyu, Aida Kuzucan, Kevin Coughlin, Haritha Pillai, Melissa McPheeters, Jill Whitaker

TL;DR
Researchers developed machine learning models to predict causes of death using health records and found that combining structured and unstructured data improves accuracy within institutions but struggles across different healthcare systems.
Contribution
The study introduces a novel approach combining structured and unstructured EHR data for cause of death prediction and highlights generalizability challenges across institutions.
Findings
XGBoost models using structured EHR data achieved AUCs of 0.86 and 0.80 at VUMC and MGB respectively.
Adding unstructured clinical notes improved AUCs to 0.90 and 0.92 at VUMC and MGB.
Cross-institutional validation showed significant performance degradation, indicating limited generalizability.
Abstract
To develop and validate machine learning (ML) models that predict probable cause of death (CoD) using structured electronic health record (EHR) data, unstructured clinical notes, and publicly available sources. This multi-institutional retrospective study was conducted across Vanderbilt University Medical Center (VUMC) and Massachusetts General Brigham (MGB), including deceased patients with encounters between October 1, 2015, and January 1, 2021, and confirmed death records. The cohort included 13 708 patients from VUMC and 34 839 from MGB.The primary outcome was underlying CoD categorized into the top 15 National Center for Health Statistics rankable causes, with others grouped as “Other.” Performance was assessed using weighted area under the receiver operating characteristic curve (AUC) and F-measure. The XGBoost model using structured EHR data alone achieved weighted AUCs of 0.86…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Sepsis Diagnosis and Treatment · Electronic Health Records Systems
