# AI Decision-Making Performance in Maternal–Fetal Medicine: Comparison of ChatGPT-4, Gemini, and Human Specialists in a Cross-Sectional Case-Based Study

**Authors:** Matan Friedman, Amit Slouk, Noa Gonen, Laura Guzy, Yael Ganor Paz, Kira Nahum Sacks, Amihai Rottenstreich, Eran Weiner, Ohad Gluck, Ilia Kleiner

PMC · DOI: 10.3390/jcm15010117 · Journal of Clinical Medicine · 2025-12-24

## TL;DR

This study compares ChatGPT-4, Gemini, and human specialists in maternal-fetal medicine decision-making, finding that AI performs reasonably in routine cases but struggles with complex scenarios.

## Contribution

The study introduces a blinded cross-sectional evaluation of AI and human performance in maternal-fetal medicine using standardized hypothetical cases.

## Key findings

- ChatGPT-4 showed moderate alignment with specialists in routine scenarios but struggled in complex cases.
- Gemini had higher average scores but lacked consistent agreement with human evaluators.
- AI models performed similarly to clinicians in accuracy for guideline-driven cases but poorly in complex decision-making.

## Abstract

Background/Objectives: Large Language Models (LLMs), including ChatGPT-4 and Gemini, are increasingly incorporated into clinical care; however, their reliability within maternal–fetal medicine (MFM), a high-risk field in which diagnostic and management errors may affect both the pregnant patient and the fetus, remains uncertain. Evaluating the alignment of AI-generated case management recommendations with those of MFM specialists, emphasizing accuracy, agreement, and clinical relevancy. Study Design and Setting: Cross-sectional study with blinded online evaluation (November–December 2024); evaluators were blinded to responder identity (AI vs. human), and case order and response labels were randomized for each evaluator using a computer-generated sequence to reduce order and identification bias. Methods: Twenty hypothetical MFM cases were constructed, allowing standardized presentation of complex scenarios without patient-identifiable data and enabling consistent comparison of AI-generated and human specialist recommendations. Responses were generated by ChatGPT-4, Gemini, and three MFM specialists, then assessed by 22 blinded board-certified MFM evaluators using a 10-point Likert scale. Agreement was measured with Spearman’s rho (ρ) and Cohen’s (κ); accuracy differences were measured with Wilcoxon signed-rank tests. Results: ChatGPT-4 exhibited moderate alignment (mean 6.6 ± 2.95; ρ = 0.408; κ = 0.232, p < 0.001), performing well in routine, guideline-driven scenarios (e.g., term oligohydramnios, well-controlled gestational hypertension, GDMA1). Gemini scored 7.0 ± 2.64, demonstrating effectively no consistent inter-rater agreement (κ = −0.024, p = 0.352), indicating that although mean scores were slightly higher, evaluators varied widely in how they judged individual Gemini responses. No significant difference was found between ChatGPT-4 and clinicians in median accuracy scores (Wilcoxon p = 0.18), while Gemini showed significantly lower accuracy (p < 0.01). Model performance varied primarily by case complexity: agreement was higher in straightforward, guideline-based scenarios and more variable in complex cases, whereas no consistent pattern was observed by gestational age or specific clinical domain across the 20 cases. Conclusions: AI shows promise in routine MFM decision-making but remains constrained in complex cases, where models sometimes under-prioritize maternal–fetal risk trade-offs or incompletely address alternative management pathways, warranting cautious integration into clinical practice. Generalizability is limited by the small number of simulated cases and the use of hypothetical vignettes rather than real-world clinical encounters.

## Full-text entities

- **Diseases:** oligohydramnios (MESH:D016104), gestational hypertension (MESH:D046110)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12787038/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12787038/full.md

## References

18 references — full list in the complete paper: https://tomesphere.com/paper/PMC12787038/full.md

---
Source: https://tomesphere.com/paper/PMC12787038