# Comparing Artificial Intelligence and Obstetrics Residents in Answering Standardized Patient Questions Regarding Gestational Diabetes

**Authors:** Azam Faraji, Hossein Faramarzi, Mahsa Razeghi, Nasrin Asadi, Homeira Vafaei, Maryam Kasraeian

PMC · DOI: 10.7759/cureus.94662 · 2025-10-15

## TL;DR

This study compared AI chatbots and medical residents in answering questions about gestational diabetes, finding that AI models performed better in accuracy and completeness.

## Contribution

Demonstrates that AI models outperform residents in answering gestational diabetes questions, suggesting potential for medical education and clinical support.

## Key findings

- AI models had significantly higher accuracy than residents in answering GDM-related questions.
- GPT-4o and DeepSeek V3 0324 showed significantly higher completeness scores than residents.
- DeepSeek V3 0324 achieved the highest scores for both accuracy and completeness.

## Abstract

Introduction

This study evaluated the performance of three artificial intelligence (AI) chatbots (GPT-3.5 (OpenAI, San Francisco, USA), GPT-4o (OpenAI, San Francisco, USA), and DeepSeek V3 0324 (DeepSeek AI, Beijing, China)) compared to eight gynecology residents in answering questions related to gestational diabetes mellitus (GDM), aiming to assess and compare the accuracy and completeness of responses to standardized patient questions on gestational diabetes in pregnancy.

Methods

Twenty-four questions were answered by three chatbots (GPT-3.5, GPT-4o, and DeepSeek V3 0324) and eight residents. Two faculty members independently rated the responses for accuracy and completeness using a 5-point scale. Independent-samples t-tests were used for statistical analysis.

Results

The mean accuracy scores were 3.64 for residents, 4.67 for GPT-3.5, 4.69 for GPT-4o, and 4.81 for DeepSeek V3 0324. The mean completeness scores were 2.05 for residents, 2.83 for GPT-3.5, 4.00 for GPT-4o, and 4.75 for DeepSeek V3 0324. T-tests showed that all AI models had significantly higher accuracy than residents (p < 0.001). Completeness scores were significantly higher for GPT-4o and DeepSeek V3 0324 (p < 0.001), while the difference between GPT-3.5 and residents for completeness was not statistically significant (p = 0.058).

Conclusion

AI models, particularly DeepSeek V3 0324 and GPT-4o, outperformed gynecology residents in both accuracy and completeness when answering GDM-related questions. These preliminary findings suggest that AI tools may complement medical education and clinical support, but further research is required before broader implementation.

## Linked entities

- **Diseases:** gestational diabetes mellitus (MONDO:0005406)

## Full-text entities

- **Diseases:** GDM (MESH:D016640)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12618048/full.md

---
Source: https://tomesphere.com/paper/PMC12618048