# Comparative Evaluation and Performance of Large Language Models in Clinical Infection Control Scenarios: A Benchmark Study

**Authors:** Shuk-Ching Wong, Edwin Kwan-Yeung Chiu, Kelvin Hei-Yeung Chiu, Anthony Raymond Tam, Pui-Hing Chau, Ming-Hong Choi, Wing-Yan Ng, Monica Oi-Tung Kwok, Benny Yu Chau, Michael Yuey-Zhun Ng, Germaine Kit-Ming Lam, Peter Wai-Ching Wong, Tom Wai-Hin Chung, Siddharth Sridhar, Edmond Siu-Keung Ma, Kwok-Yung Yuen, Vincent Chi-Chung Cheng

PMC · DOI: 10.3390/healthcare13202652 · 2025-10-21

## TL;DR

This study compares how well large language models can help infection control nurses in hospitals, finding that while some models perform well, they still need human oversight.

## Contribution

The paper introduces a benchmark study evaluating LLMs in clinical infection control scenarios, highlighting their potential and limitations as decision-support tools.

## Key findings

- GPT-4.1 and DeepSeek V3 outperformed Gemini 2.5 Pro Exp in IPC advice quality and evidence-based recommendations.
- Structured prompting improved LLM responses, especially in evidence quality.
- Doctors rated LLM outputs higher than nurses, but all models had critical clinical judgment errors.

## Abstract

Background: Infection prevention and control (IPC) in hospitals relies heavily on infection control nurses (ICNs) who manage complex consultations to prevent and control infections. This study evaluated large language models (LLMs) as artificial intelligence (AI) tools to support ICNs in IPC decision-making processes. Our goal is to enhance the efficiency of IPC practices while maintaining the highest standards of safety and accuracy. Methods: A cross-sectional benchmarking study at Queen Mary Hospital, Hong Kong assessed three LLMs—GPT-4.1, DeepSeek V3, and Gemini 2.5 Pro Exp—using 30 clinical infection control scenarios. Each model generated clarifying questions to understand the scenarios before providing IPC recommendations through two prompting methods: an open-ended inquiry and a structured template. Sixteen experts, including senior and junior ICNs and physicians, rated these responses on coherence, conciseness, usefulness and relevance, evidence quality, and actionability (1–10 scale). Quantitative and qualitative analyses assessed AI performance, reliability, and clinical applicability. Results: GPT-4.1 and DeepSeek V3 scored significantly higher on the composite quality scale, with adjusted means (95% CI) of 36.77 (33.98–39.57) and 36.25 (33.45–39.04), respectively, compared with Gemini 2.5 Pro Exp at 33.19 (30.39–35.99) (p < 0.001). GPT-4.1 led in evidence quality, usefulness, and relevance. Gemini 2.5 Pro Exp failed to generate responses in 50% of scenarios under structured prompt conditions. Structured prompting yielded significant improvements, primarily by enhancing evidence quality (p < 0.001). Evaluator background influenced scoring, with doctors rating outputs higher than nurses (38.83 vs. 32.06, p < 0.001). However, a qualitative review revealed critical deficiencies across all models, for example, tuberculosis treatment solely based on a positive acid-fast bacilli (AFB) smear without considering nontuberculous mycobacteria in DeepSeek V3 and providing an impractical and noncommittal response regarding the de-escalation of precautions for Candida auris in Gemini 2.5 Pro Exp. These errors highlight potential safety risks and limited real-world applicability, despite generally positive scores. Conclusions: While GPT-4.1 and DeepSeek V3 deliver useful IPC advice, they are not yet reliable for autonomous use. Critical errors in clinical judgment and practical applicability highlight that LLMs cannot replace the expertise of ICNs. These technologies should serve as adjunct tools to support, rather than automate, clinical decision-making.

## Linked entities

- **Diseases:** tuberculosis (MONDO:0018076)

## Full-text entities

- **Diseases:** Infection (MESH:D007239), tuberculosis (MESH:D014376)
- **Chemicals:** Gemini (-)
- **Species:** Candidozyma auris (species) [taxon 498019], Mycobacteriales (order) [taxon 85007]

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12563182/full.md

---
Source: https://tomesphere.com/paper/PMC12563182