# Large Language Models in Infectious Diseases: A Systemic Review

**Authors:** Alon Gorenshtein, Eyal Klang, Jacob J. Smith, Richard Dzeng, Mark C. Poznansky, Girish N Nadkarni, Mahmud Omar

PMC · DOI: 10.21203/rs.3.rs-8901882/v1 · Research Square · 2026-02-18

## TL;DR

This review finds that large language models (LLMs) can help with infectious disease diagnosis and stewardship but are unreliable for autonomous use due to errors and fabricated content.

## Contribution

The paper provides a systematic review of LLMs in infectious diseases, highlighting safety issues and performance gaps compared to experts.

## Key findings

- LLMs show high diagnostic sensitivity for structured infections but have high error rates and fabricated content.
- Retrieval-augmented systems improved specificity and reduced hallucinations compared to standard models.
- Most studies had high risk of bias and used non-real clinical data, limiting reliability.

## Abstract

Clinical reasoning in infectious diseases relies on validated evidence. LLMs are being introduced into diagnosis, antimicrobial stewardship, and guideline interpretation before their safety and reliability are established.

This review, registered in PROSPERO (CRD420251155354), evaluated studies using GPT, Claude, Gemini, and retrieval-augmented or agentic systems for infectious disease decision-making. PubMed, CENTRAL, Scopus, and Web of Science were searched from January 2018 to September 2025. Two reviewers screened and extracted data. Risk of bias was assessed with QUADAS-AI.

Thirty-one studies met inclusion criteria. Most were cross-sectional (61%) and vignette-based (68%). Only 32% used real clinical data; 23% had low risk of bias. Safety issues were reported in 90% of studies: incomplete responses (61%), unsafe advice (23–32%), and fabricated content (32%). In antimicrobial stewardship, agreement with infectious-disease specialists was ~ 50%. Diagnostic sensitivity for structured infections was 80–100%. Retrieval-augmented systems increased specificity from 35% to 75% and reduced hallucinations. Proprietary models outperformed open-source models but did not reach expert accuracy.

LLMs perform well in defined diagnostic tasks but remain unreliable for autonomous clinical use. High error rates, inconsistent reasoning, and fabricated content require expert oversight and external validation before deployment.

Large language models show promise in infectious diseases for narrow diagnostic tasks, surveillance, and stewardship, but most evidence is retrospective and high risk of bias. Hallucinations and context errors persist. External validation, human oversight, and retrieval-augmented or agentic safeguards are essential.

## Full-text entities

- **Diseases:** Infectious Diseases (MESH:D003141), infections (MESH:D007239), hallucinations (MESH:D006212)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12934913/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12934913/full.md

## References

49 references — full list in the complete paper: https://tomesphere.com/paper/PMC12934913/full.md

---
Source: https://tomesphere.com/paper/PMC12934913