Benchmarking LLMs for Predictive Applications in the Intensive Care Units

Chehak Malhotra; Mehak Gopal; Akshaya Devadiga; Pradeep Singh; Ridam Pal; Ritwik Kashyap; Tavpritesh Sethi

arXiv:2512.20520·cs.AI·December 24, 2025

Benchmarking LLMs for Predictive Applications in the Intensive Care Units

Chehak Malhotra, Mehak Gopal, Akshaya Devadiga, Pradeep Singh, Ridam Pal, Ritwik Kashyap, Tavpritesh Sethi

PDF

Open Access

TL;DR

This study benchmarks various large language models for predicting shock in ICU patients, revealing that LLMs are not necessarily superior to smaller models for clinical prediction tasks, emphasizing the need for models focused on clinical trajectory prediction.

Contribution

It provides a comparative analysis of LLMs and SLMs in ICU shock prediction, highlighting the limited advantage of LLMs in this domain and suggesting future directions for model development.

Findings

01

GatorTron Base achieved 80.5% weighted recall.

02

Performance was similar between LLMs and SLMs.

03

LLMs are not inherently better for clinical event prediction.

Abstract

With the advent of LLMs, various tasks across the natural language processing domain have been transformed. However, their application in predictive tasks remains less researched. This study compares large language models, including GatorTron-Base (trained on clinical data), Llama 8B, and Mistral 7B, against models like BioBERT, DocBERT, BioClinicalBERT, Word2Vec, and Doc2Vec, setting benchmarks for predicting Shock in critically ill patients. Timely prediction of shock can enable early interventions, thus improving patient outcomes. Text data from 17,294 ICU stays of patients in the MIMIC III database were scored for length of stay > 24 hours and shock index (SI) > 0.7 to yield 355 and 87 patients with normal and abnormal SI-index, respectively. Both focal and cross-entropy losses were used during finetuning to address class imbalances. Our findings indicate that while GatorTron Base…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Sepsis Diagnosis and Treatment