ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks

Yinghao Zhu; Junyi Gao; Zixiang Wang; Weibin Liao; Xiaochen Zheng; Lifang Liang; Miguel O. Bernabeu; Yasha Wang; Lequan Yu; Chengwei Pan; Ewen M. Harrison; Liantao Ma

arXiv:2407.18525·cs.CL·October 7, 2025·2 cites

ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks

Yinghao Zhu, Junyi Gao, Zixiang Wang, Weibin Liao, Xiaochen Zheng, Lifang Liang, Miguel O. Bernabeu, Yasha Wang, Lequan Yu, Chengwei Pan, Ewen M. Harrison, Liantao Ma

PDF

Open Access 1 Repo

TL;DR

This study benchmarks various large language models against traditional methods for non-generative clinical prediction tasks, revealing that modern LLMs often outperform specialized models, especially in data-scarce scenarios and unstructured text analysis.

Contribution

It provides a comprehensive systematic benchmarking of LLMs versus traditional models in clinical prediction, highlighting the competitive performance of modern LLMs in various settings.

Findings

01

LLMs outperform finetuned BERT in clinical note predictions.

02

Advanced LLMs excel in zero-shot structured EHR tasks.

03

Open-source LLMs match or surpass proprietary models.

Abstract

Large Language Models (LLMs) are increasingly deployed in medicine. However, their utility in non-generative clinical prediction, often presumed inferior to specialized models, remains under-evaluated, leading to ongoing debate within the field and potential for misuse, misunderstanding, or over-reliance due to a lack of systematic benchmarking. Our ClinicRealm study addresses this by benchmarking 15 GPT-style LLMs, 5 BERT-style models, and 11 traditional methods on unstructured clinical notes and structured Electronic Health Records (EHR), while also assessing their reasoning, reliability, and fairness. Key findings reveal a significant shift: for clinical note predictions, leading LLMs (e.g., DeepSeek-V3.1-Think, GPT-5) in zero-shot settings now decisively outperform finetuned BERT models. On structured EHRs, while specialized models excel with ample data, advanced LLMs (e.g., GPT-5,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yhzhu99/ehr-llm-benchmark
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Healthcare

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Byte Pair Encoding · Discriminative Fine-Tuning · Linear Warmup With Cosine Annealing · GPT · Linear Layer · Linear Warmup With Linear Decay · Dense Connections