Comparative Evaluation and Performance of Large Language Models in Clinical Infection Control Scenarios: A Benchmark Study
Shuk-Ching Wong, Edwin Kwan-Yeung Chiu, Kelvin Hei-Yeung Chiu, Anthony Raymond Tam, Pui-Hing Chau, Ming-Hong Choi, Wing-Yan Ng, Monica Oi-Tung Kwok, Benny Yu Chau, Michael Yuey-Zhun Ng, Germaine Kit-Ming Lam, Peter Wai-Ching Wong, Tom Wai-Hin Chung, Siddharth Sridhar

TL;DR
This study compares how well large language models can help infection control nurses in hospitals, finding that while some models perform well, they still need human oversight.
Contribution
The paper introduces a benchmark study evaluating LLMs in clinical infection control scenarios, highlighting their potential and limitations as decision-support tools.
Findings
GPT-4.1 and DeepSeek V3 outperformed Gemini 2.5 Pro Exp in IPC advice quality and evidence-based recommendations.
Structured prompting improved LLM responses, especially in evidence quality.
Doctors rated LLM outputs higher than nurses, but all models had critical clinical judgment errors.
Abstract
Background: Infection prevention and control (IPC) in hospitals relies heavily on infection control nurses (ICNs) who manage complex consultations to prevent and control infections. This study evaluated large language models (LLMs) as artificial intelligence (AI) tools to support ICNs in IPC decision-making processes. Our goal is to enhance the efficiency of IPC practices while maintaining the highest standards of safety and accuracy. Methods: A cross-sectional benchmarking study at Queen Mary Hospital, Hong Kong assessed three LLMs—GPT-4.1, DeepSeek V3, and Gemini 2.5 Pro Exp—using 30 clinical infection control scenarios. Each model generated clarifying questions to understand the scenarios before providing IPC recommendations through two prompting methods: an open-ended inquiry and a structured template. Sixteen experts, including senior and junior ICNs and physicians, rated these…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · COVID-19 diagnosis using AI
