Comparative Evaluation and Performance of Large Language Models in Clinical Infection Control Scenarios: A Benchmark Study

Shuk-Ching Wong; Edwin Kwan-Yeung Chiu; Kelvin Hei-Yeung Chiu; Anthony Raymond Tam; Pui-Hing Chau; Ming-Hong Choi; Wing-Yan Ng; Monica Oi-Tung Kwok; Benny Yu Chau; Michael Yuey-Zhun Ng; Germaine Kit-Ming Lam; Peter Wai-Ching Wong; Tom Wai-Hin Chung; Siddharth Sridhar; Edmond Siu-Keung Ma; Kwok-Yung Yuen; Vincent Chi-Chung Cheng

PMC · DOI:10.3390/healthcare13202652·October 21, 2025

Comparative Evaluation and Performance of Large Language Models in Clinical Infection Control Scenarios: A Benchmark Study

Shuk-Ching Wong, Edwin Kwan-Yeung Chiu, Kelvin Hei-Yeung Chiu, Anthony Raymond Tam, Pui-Hing Chau, Ming-Hong Choi, Wing-Yan Ng, Monica Oi-Tung Kwok, Benny Yu Chau, Michael Yuey-Zhun Ng, Germaine Kit-Ming Lam, Peter Wai-Ching Wong, Tom Wai-Hin Chung, Siddharth Sridhar

PDF

Open Access

TL;DR

This study compares how well large language models can help infection control nurses in hospitals, finding that while some models perform well, they still need human oversight.

Contribution

The paper introduces a benchmark study evaluating LLMs in clinical infection control scenarios, highlighting their potential and limitations as decision-support tools.

Findings

01

GPT-4.1 and DeepSeek V3 outperformed Gemini 2.5 Pro Exp in IPC advice quality and evidence-based recommendations.

02

Structured prompting improved LLM responses, especially in evidence quality.

03

Doctors rated LLM outputs higher than nurses, but all models had critical clinical judgment errors.

Abstract

Background: Infection prevention and control (IPC) in hospitals relies heavily on infection control nurses (ICNs) who manage complex consultations to prevent and control infections. This study evaluated large language models (LLMs) as artificial intelligence (AI) tools to support ICNs in IPC decision-making processes. Our goal is to enhance the efficiency of IPC practices while maintaining the highest standards of safety and accuracy. Methods: A cross-sectional benchmarking study at Queen Mary Hospital, Hong Kong assessed three LLMs—GPT-4.1, DeepSeek V3, and Gemini 2.5 Pro Exp—using 30 clinical infection control scenarios. Each model generated clarifying questions to understand the scenarios before providing IPC recommendations through two prompting methods: an open-ended inquiry and a structured template. Sixteen experts, including senior and junior ICNs and physicians, rated these…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species2

Candidozyma auris(species)Mycobacteriales(order)

Chemicals1

Gemini

Diseases2

tuberculosis Infection

Figures3

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · COVID-19 diagnosis using AI