# Artificial intelligence-driven clinical guideline recommendations in maternal care: How trustworthy are they?

**Authors:** Jairo J. Pérez, Andrés F. Giraldo-Forero, Santiago Rúa, Daniel Betancur, Zuliany Urquina, Pablo Castañeda, Sara Arango-Valencia, Juan Guillermo Barrientos-Gómez, Ever A. Torres-Silva, Andrés Orozco-Duque

PMC · DOI: 10.7705/biomedica.7902 · Biomédica · 2025-12-10

## TL;DR

This study evaluates how well AI models follow maternal care guidelines, finding that large models like GPT-3.5 perform best in providing accurate and relevant clinical answers.

## Contribution

The study introduces a standardized evaluation framework for assessing AI-generated maternal health answers using physician-defined ground truth and retrieval-augmented generation systems.

## Key findings

- GPT-3.5 achieved the highest physician-assessed accuracy (0.90) in maternal care guideline responses.
- Large language models like GPT-3.5 and Claude 3.5 outperformed lighter models such as Llama 8B in answer relevance and faithfulness.
- Rigorous validation is needed before deploying AI in clinical settings to ensure accuracy and reliability.

## Abstract

Medical staff often face difficulties in consulting and applying clinical guidelines in practice. Large language models, especially when combined with retrieval- augmented generation, may help overcome these challenges by producing context-specific outputs with improved adherence to medical guidelines.

To assess the performance of commercial large language models in answering maternal health questions within retrieval-augmented generation systems, using both human and automated evaluation metrics.

A controlled experiment was designed to obtain accurate, consistent answers from a retrieval-augmented generation system based on Colombian maternal care guidelines. A physician formulated ten questions and defined the ground- truth answers. Various large language models were tested with a standardized prompt and evaluated through binary answer-concept ranking and retrieval-augmented generation assessment, metrics, judged by two independent large language models.

Generative pre-trained transformer 3.5 (GPT-3.5) achieved the highest physician- assessed accuracy (0.90). Claude 3.5 obtained the top faithfulness score (0.78) under GPT-4.0 evaluation, while Mistral ranked highest (0.84) under Claude 3.5 evaluation. Regarding answer relevance, GPT-3.5 scored highest across both judges (0.94 and 0.86).

Integrating retrieval-augmented generation into obstetric care has the potential to enhance evidence-based practices and improve patient outcomes. However, rigorous validation of accuracy and context-specific reliability is essential before clinical deployment. The findings of this study indicate that large-scale models (e.g., GPT-3.5, Claude, Llama 70B) consistently outperform lighter models such as Llama 8B.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12931962/full.md

## Figures

15 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12931962/full.md

## References

29 references — full list in the complete paper: https://tomesphere.com/paper/PMC12931962/full.md

---
Source: https://tomesphere.com/paper/PMC12931962