ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks
Yinghao Zhu, Junyi Gao, Zixiang Wang, Weibin Liao, Xiaochen Zheng, Lifang Liang, Miguel O. Bernabeu, Yasha Wang, Lequan Yu, Chengwei Pan, Ewen M. Harrison, Liantao Ma

TL;DR
This study benchmarks various large language models against traditional methods for non-generative clinical prediction tasks, revealing that modern LLMs often outperform specialized models, especially in data-scarce scenarios and unstructured text analysis.
Contribution
It provides a comprehensive systematic benchmarking of LLMs versus traditional models in clinical prediction, highlighting the competitive performance of modern LLMs in various settings.
Findings
LLMs outperform finetuned BERT in clinical note predictions.
Advanced LLMs excel in zero-shot structured EHR tasks.
Open-source LLMs match or surpass proprietary models.
Abstract
Large Language Models (LLMs) are increasingly deployed in medicine. However, their utility in non-generative clinical prediction, often presumed inferior to specialized models, remains under-evaluated, leading to ongoing debate within the field and potential for misuse, misunderstanding, or over-reliance due to a lack of systematic benchmarking. Our ClinicRealm study addresses this by benchmarking 15 GPT-style LLMs, 5 BERT-style models, and 11 traditional methods on unstructured clinical notes and structured Electronic Health Records (EHR), while also assessing their reasoning, reliability, and fairness. Key findings reveal a significant shift: for clinical note predictions, leading LLMs (e.g., DeepSeek-V3.1-Think, GPT-5) in zero-shot settings now decisively outperform finetuned BERT models. On structured EHRs, while specialized models excel with ample data, advanced LLMs (e.g., GPT-5,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Healthcare
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Byte Pair Encoding · Discriminative Fine-Tuning · Linear Warmup With Cosine Annealing · GPT · Linear Layer · Linear Warmup With Linear Decay · Dense Connections
