Can Large Language Models Logically Predict Myocardial Infarction? Evaluation based on UK Biobank Cohort
Yuxing Zhi, Yuan Guo, Kai Yuan, Hesong Wang, Heng Xu, Haina Yao,, Albert C Yang, Guangrui Huang, Yuping Duan

TL;DR
This study assesses whether state-of-the-art large language models like ChatGPT and GPT-4 can accurately predict myocardial infarction risk from UK Biobank data, revealing current limitations in clinical decision support applications.
Contribution
It provides a quantitative evaluation of LLMs' ability to predict MI risk using real-world medical data and compares their performance with traditional models and medical indices.
Findings
LLMs currently lack sufficient accuracy for clinical MI prediction
Chain of Thought prompting helps evaluate logical inference in LLMs
Future medical LLMs should integrate domain knowledge for better performance
Abstract
Background: Large language models (LLMs) have seen extraordinary advances with applications in clinical decision support. However, high-quality evidence is urgently needed on the potential and limitation of LLMs in providing accurate clinical decisions based on real-world medical data. Objective: To evaluate quantitatively whether universal state-of-the-art LLMs (ChatGPT and GPT-4) can predict the incidence risk of myocardial infarction (MI) with logical inference, and to further make comparison between various models to assess the performance of LLMs comprehensively. Methods: In this retrospective cohort study, 482,310 participants recruited from 2006 to 2010 were initially included in UK Biobank database and later on resampled into a final cohort of 690 participants. For each participant, tabular data of the risk factors of MI were transformed into standardized textual descriptions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare
