# Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

**Authors:** Lavender Y. Jiang, Angelica Chen, Xu Han, Xujin Chris Liu, Radhika Dua, Kevin Eaton, Frederick Wolff, Robert Steele, Jeff Zhang, Anton Alyakin, Qingkai Pan, Yanbing Chen, Karl L. Sangwon, Daniel A. Alber, Jaden Stryker, Jin Vivian Lee, Yindalon Aphinyanaphongs, Kyunghyun Cho, Eric Karl Oermann

PMC · DOI: 10.21203/rs.3.rs-9078142/v1 · 2026-03-19

## TL;DR

This paper shows that healthcare AI models need to be trained on real hospital data to perform well in tasks like predicting patient readmissions or insurance denials.

## Contribution

The paper introduces Lang1, a new family of language models trained on a blend of clinical and internet data, showing superior performance in hospital operations tasks.

## Key findings

- Lang1-1B outperforms larger generalist and zero-shot models in four of five hospital operations tasks after fine-tuning.
- Multi-task fine-tuning improves Lang1's ability to transfer to unseen tasks and external health systems.
- The study emphasizes the importance of in-domain pretraining and supervised fine-tuning for effective healthcare AI.

## Abstract

Operational decisions governing patient flow, cost, and quality of care demand specialized predictive models, yet most clinical NLP efforts focus on medical knowledge benchmarks. We introduce Lang1, a family of language models (100M-7B parameters) pretrained on 80 billion clinical tokens from NYU Langone Health electronic health records blended with 627 billion internet tokens. We evaluate Lang1 on the REalistic Medical Evaluation (ReMedE), an evaluation suite derived from 668,331 Electronic Health Records (EHR) notes spanning five tasks: readmission, mortality prediction, length of stay, comorbidity coding, and insurance denial. In zero-shot settings, both general-purpose and biomedical models underperform on four of five tasks. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70 × larger and zero-shot models up to 671× larger. Joint multi-task finetuning yields cross-task transfer, and Lang1-1B transfers effectively to unseen tasks and an external health system. These results demonstrate that effective healthcare AI requires in-domain pretraining, supervised finetuning, and evaluation beyond proxy benchmarks.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13015597/full.md

---
Source: https://tomesphere.com/paper/PMC13015597