Secure On-Premise Deployment of Open-Weights Large Language Models in Radiology: An Isolation-First Architecture with Prospective Pilot Evaluation

Sebastian Nowak; Jann-Frederick La{\ss}; Narine Mesropyan; Babak Salam; Nico Piel; Mohammed Bahaaeldin; Wolfgang Block; Alois Martin Sprinkart; Julian Alexander Luetkens; Benjamin Wulff; Alexander Isaak

arXiv:2604.22768·cs.CY·April 28, 2026

Secure On-Premise Deployment of Open-Weights Large Language Models in Radiology: An Isolation-First Architecture with Prospective Pilot Evaluation

Sebastian Nowak, Jann-Frederick La{\ss}, Narine Mesropyan, Babak Salam, Nico Piel, Mohammed Bahaaeldin, Wolfgang Block, Alois Martin Sprinkart, Julian Alexander Luetkens, Benjamin Wulff, Alexander Isaak

PDF

1 Repo

TL;DR

This paper presents a secure, on-premise LLM deployment architecture for radiology that emphasizes regulatory compliance, network isolation, and clinical utility, demonstrated through a pilot with radiologists.

Contribution

It introduces an isolation-first, containerized LLM infrastructure with automated testing, enabling regulatory approval and practical clinical use in a hospital setting.

Findings

01

System was approved by institutional governance for processing PHI.

02

Rated stable and user-friendly by radiologists during pilot.

03

Text-anchored tasks received high utility ratings, but open-ended tasks had more critical errors.

Abstract

Purpose: To design, implement, evaluate, and report on the regulatory requirements of a self-hosted LLM infrastructure for radiology adhering to the principle of least privilege, emphasizing technical feasibility, network isolation, and clinical utility. Materials and Methods: The isolation-first, containerized LLM inference stack relies on strict network segmentation, host-enforced egress filtering, and active isolation monitoring preventing unauthorized external connectivity. An accompanying deployment package provides automated isolation and hardening tests. The system served the open-weights DeepSeek-R1 model via vLLM. In a one-week pilot phase, 22 residents and radiologists were free to use 10 predefined prompt-templates whenever they considered them useful in daily work. Afterward, they rated clinical utility and system stability on an 0-10 Likert scale and reported observed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ukbonn/ukb-gpt
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.