# Evaluation of a Retrieval-Augmented Generation Chatbot for Antimicrobial Resistance Research: Comparative Analysis of Large Language Models

**Authors:** Oscar Escudero-Arnanz, Manuel Eduardo Valero-Méndez, Noelia Sánchez-Ramos, Cristina Soguero-Ruíz

PMC · DOI: 10.2196/83206 · JMIR AI · 2026-03-24

## TL;DR

This paper evaluates a chatbot using different AI models to help with antimicrobial resistance research, comparing their accuracy, cost, and speed.

## Contribution

The study introduces a RAG chatbot for AMR literature and compares multiple LLMs in terms of performance, cost, and scalability.

## Key findings

- GPT-4 achieved the highest correctness score but at a high cost.
- GPT-4o provided similar accuracy at a much lower cost and faster speed.
- LLaMA 4 Maverick and GPT-4o-mini offered lower accuracy but reduced costs.

## Abstract

Antimicrobial resistance (AMR) poses a critical global health threat, undermining the efficacy of antibiotics and complicating clinical decision-making. Although scientific literature on AMR is extensive, retrieving and synthesizing relevant evidence remains time-consuming for clinicians and researchers. Recent advances in large language models (LLMs) offer opportunities to enhance access to domain-specific knowledge. However, the diversity of available models, ranging from open-source to commercial, necessitates a systematic comparison of their performance, cost, and scalability in real-world biomedical applications.

This study aims to describe the development of a retrieval-augmented generation (RAG) chatbot for AMR literature analysis and compare multiple commercial and open-source LLMs in terms of accuracy, faithfulness, response time, and cost-efficiency.

A corpus of 164 peer-reviewed AMR-related articles was compiled from Google Scholar and embedded into a ChromaDB vector database using OpenAI’s text-embedding-ada-002. The RAG chatbot was implemented to operate with 5 LLM backbones: GPT-4, GPT-4o, GPT-4o-mini, Claude 3.7 Sonnet, and LLaMA 4 Maverick. For each model, a temperature ablation study was performed to determine optimal performance. Evaluation metrics included correctness (pass rate and score), faithfulness, relevancy, computational cost, and latency, using a synthetic ground truth dataset generated with GPT-4.

All models generated scientifically grounded responses when integrated into the RAG framework. GPT-4 achieved the highest correctness score (94.7%) but incurred the highest cost, while GPT-4o delivered nearly identical accuracy at a 9-fold lower cost and the fastest response time (3.88 s). LLaMA 4 Maverick and GPT-4o-mini offered lower accuracy but substantially reduced operational costs. Claude 3.7 Sonnet showed competitive accuracy, but the least favorable cost-performance ratio. Qualitative analysis revealed differences in response style, detail, and structure among models.

A RAG-based chatbot can effectively support AMR research by delivering accurate, context-grounded, and scalable access to scientific literature. The comparative evaluation highlights trade-offs between performance, cost, and speed, guiding the selection of LLM architectures for clinical and research settings. Future work will focus on integrating language-specific embeddings and specialized domain agents to further enhance accuracy, adaptability, and clinical use.

## Full-text entities

- **Diseases:** PDR (MESH:D000069279), Multilobar pneumonia (MESH:D011014), deaths (MESH:D003643), bacteremia (MESH:D016470), status epilepticus (MESH:D013226), brain injuries (MESH:D001930), Infection (MESH:D007239), Septic shock (MESH:D012772), subdural hematoma (MESH:D006408), hallucinations (MESH:D006212), trauma (MESH:D014947), AMR (MESH:D060467), VAP (MESH:D053717), cystic fibrosis infections (MESH:D003550), sepsis (MESH:D018805), LLMs (MESH:D007806), Coma (MESH:D003128), MDR (MESH:D018088), nosocomial infections (MESH:D003428)
- **Chemicals:** azole (MESH:D001393), 4o (-), carbapenem (MESH:D015780), K (MESH:D011188)
- **Species:** Enterobacterales (order) [taxon 91347], Human immunodeficiency virus 1 (no rank) [taxon 11676], Bacteria Latreille et al. 1825 (Bacteria stick insect, genus) [taxon 629395], Staphylococcus aureus (species) [taxon 1280], Mycobacterium tuberculosis (species) [taxon 1773], Enterococcus faecium (species) [taxon 1352], Homo sapiens (human, species) [taxon 9606], Candidozyma auris (species) [taxon 498019], Acinetobacter baumannii (species) [taxon 470], Enterobacter (genus) [taxon 547], Pseudomonas aeruginosa (species) [taxon 287], Aspergillus fumigatus (species) [taxon 746128], Klebsiella pneumoniae (species) [taxon 573], Viruses (acellular root) [taxon 10239], H1N1 subtype (serotype) [taxon 114727]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13012410/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13012410/full.md

## References

29 references — full list in the complete paper: https://tomesphere.com/paper/PMC13012410/full.md

---
Source: https://tomesphere.com/paper/PMC13012410