Safety and accuracy follow different scaling laws in clinical large language models

Sebastian Wind; Tri-Thien Nguyen; Jeta Sopa; Mahshad Lotfinia; Sebastian Bickelhaup; Michael Uder; Harald K\"ostler; Gerhard Wellein; Sven Nebelung; Daniel Truhn; Andreas Maier; Soroosh Tayebi Arasteh

arXiv:2605.04039·cs.CL·May 6, 2026

Safety and accuracy follow different scaling laws in clinical large language models

Sebastian Wind, Tri-Thien Nguyen, Jeta Sopa, Mahshad Lotfinia, Sebastian Bickelhaup, Michael Uder, Harald K\"ostler, Gerhard Wellein, Sven Nebelung, Daniel Truhn, Andreas Maier, Soroosh Tayebi Arasteh

PDF

TL;DR

This paper introduces SaFE-Scale, a framework and benchmark to evaluate how clinical large language models' safety and accuracy are affected by various scaling and deployment factors, revealing safety is influenced by evidence quality and retrieval strategies.

Contribution

The paper presents SaFE-Scale and RadSaFE-200 benchmark to systematically measure safety and accuracy trade-offs in clinical LLMs across different deployment conditions.

Findings

01

Clean evidence significantly improves accuracy and safety metrics.

02

Standard and agentic RAG do not fully enhance safety despite accuracy gains.

03

Max-context prompting and additional compute yield limited safety improvements.

Abstract

Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.