Large language models for patient education prior to interventional radiology procedures: a comparative study

Bogdan Levita; Semil Eminovic; Willie Magnus Lüdemann; Dirk Schnapauff; Robin Schmidt; Anna-Maria Haack; Andrea Dell’Orco; Jawed Nawabi; Tobias Penzkofer

PMC · DOI:10.1186/s42155-025-00609-z·October 13, 2025

Large language models for patient education prior to interventional radiology procedures: a comparative study

Bogdan Levita, Semil Eminovic, Willie Magnus Lüdemann, Dirk Schnapauff, Robin Schmidt, Anna-Maria Haack, Andrea Dell’Orco, Jawed Nawabi, Tobias Penzkofer

PDF

Open Access

TL;DR

This study compares how well four large language models can answer patient questions about specific interventional radiology procedures, finding that some models perform well enough to potentially aid patient education.

Contribution

The study evaluates LLMs for patient education in interventional radiology and identifies performance differences across models and procedures.

Findings

01

DeepSeek-V3 and ChatGPT-4o outperformed OpenBioLLM-8b and BioMistral-7b in answering questions about interventional radiology procedures.

02

Preparation/Planning was the only category without significant differences across all models and procedures.

03

LLMs like DeepSeek-V3 and ChatGPT-4o show potential to enhance patient education but cannot replace clinical consultations yet.

Abstract

This study evaluates four large language models’ (LLMs) ability to answer common patient questions preceding transarterial periarticular embolization (TAPE), computed tomography (CT)-guided high-dose-rate (HDR) brachytherapy, and bleomycin electrosclerotherapy (BEST). The goal is to evaluate their potential to enhance clinical workflows and patient comprehension, while also assessing associated risks. Thirty-five TAPE, 34 CT-HDR brachytherapy, and 36 BEST related questions were presented to ChatGPT-4o, DeepSeek-V3, OpenBioLLM-8b, and BioMistral-7b. The LLM-generated responses were independently assessed by two board-certified radiologists. Accuracy was rated on a 5-point Likert scale. Statistics compared LLM performance across question categories for patient-education suitability. DeepSeek-V3 attained the highest mean scores for BEST [4.49 (± 0.77)] and CT-HDR [4.24 (± 0.81)] and…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Chemicals1

bleomycin

Figures2

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Radiomics and Machine Learning in Medical Imaging · Radiology practices and education