# Artificial intelligence in obstetrics and gynecology: Evaluating ChatGPT and Google Gemini in answering patient questions

**Authors:** Madeline West, Amir Alsaidi, Rohail Siddiqi, Fatima Sayyed, Rachael Counts, Lauren Quinto, Nicholas Stansbury

PMC · DOI: 10.1002/ijgo.70622 · International Journal of Gynaecology and Obstetrics · 2025-10-28

## TL;DR

This study compares how well ChatGPT and Google Gemini answer common patient questions in obstetrics and gynecology, finding both models mostly accurate but with some differences.

## Contribution

The study evaluates two LLMs for their accuracy and completeness in answering patient questions in obstetrics and gynecology using physician assessments.

## Key findings

- Both ChatGPT and Google Gemini provided largely accurate and complete responses to patient questions.
- ChatGPT demonstrated stronger outcomes overall compared to Google Gemini.
- Patients should confirm online information with physicians due to the limitations of LLMs.

## Abstract

To evaluate the accuracy and completeness of responses across common obstetrical and gynecologic topics generated by the large language models (LLMs) ChatGPT and Google Gemini, which have become increasingly popular for patients seeking medical information before physician consultations.

Ten topics were identified, five obstetrical (prenatal labs, extended carrier screen, treatments for nausea and vomiting in pregnancy, gestational diabetes, and trial of labor after cesarean section) and five gynecologic (polycystic ovary syndrome, pelvic inflammatory disease, cervical smears, mammograms, and birth control). For each condition, ChatGPT generated five of the most frequently asked patient questions, which were then presented separately to ChatGPT and Google Gemini. Board‐certified Obstetrics and Gynecology physicians evaluated the responses using Likert scales for accuracy (1–6) and completeness (1–3).

Acceptable response criteria were defined as an accuracy score of 5 or greater (“nearly all correct”) and a completeness score of 2 or greater (“adequately complete”). Most responses from both models met these thresholds. Wilcoxon signed‐rank tests demonstrated statistically significant differences in accuracy and completeness between models (P < 0.05). Inter‐rater agreement was measured using intraclass correlation coefficients. For obstetrical topics, ChatGPT scored −0.047 (completeness) and 0.112 (accuracy), whereas Google Gemini scored 0.367 and 0.205, respectively. For gynecologic topics, ChatGPT scored 0.328 and 0.20, compared with Google Gemini at 0.151 and −0.08.

Both LLMs provided largely accurate and complete responses to patient questions. ChatGPT demonstrated stronger outcomes overall, suggesting potential utility in patient education; however, patients should confirm online information with physicians given the limitations of LLMs.

## Linked entities

- **Diseases:** pelvic inflammatory disease (MONDO:0000922), polycystic ovary syndrome (MONDO:0008487)

## Full-text entities

- **Diseases:** polycystic ovary syndrome (MESH:D011085), gestational diabetes (MESH:D016640), nausea and vomiting (MESH:D020250), pelvic inflammatory disease (MESH:D000292)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12988381/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12988381/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/PMC12988381/full.md

---
Source: https://tomesphere.com/paper/PMC12988381