Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

Lavender Y. Jiang; Angelica Chen; Xu Han; Xujin Chris Liu; Radhika Dua; Kevin Eaton; Frederick Wolff; Robert Steele; Jeff Zhang; Anton Alyakin; Qingkai Pan; Yanbing Chen; Karl L. Sangwon; Daniel A. Alber; Jaden Stryker; Jin Vivian Lee; Yindalon Aphinyanaphongs; Kyunghyun Cho; Eric Karl Oermann

arXiv:2511.13703·cs.CL·November 18, 2025

Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

Lavender Y. Jiang, Angelica Chen, Xu Han, Xujin Chris Liu, Radhika Dua, Kevin Eaton, Frederick Wolff, Robert Steele, Jeff Zhang, Anton Alyakin, Qingkai Pan, Yanbing Chen, Karl L. Sangwon, Daniel A. Alber, Jaden Stryker, Jin Vivian Lee, Yindalon Aphinyanaphongs, Kyunghyun Cho

PDF

Open Access

TL;DR

This paper introduces Lang1, a specialized clinical language model trained on EHR data, and demonstrates that with supervised finetuning, it significantly outperforms generalist models on hospital operational tasks in real-world settings.

Contribution

The paper presents Lang1, a new domain-specific model trained on clinical data, and introduces ReMedE, a benchmark for evaluating hospital operation tasks, showing the importance of in-domain training and finetuning.

Findings

01

Lang1-1B outperforms larger generalist models after finetuning.

02

Supervised finetuning on domain data improves performance significantly.

03

In-domain pretraining enhances transferability to out-of-distribution clinical tasks.

Abstract

Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Topic Modeling