# Comparative evaluation of AI language models in educating patients on women’s sexual health

**Authors:** Yash H. Kadakia, Muhammed A. M. Hammad, Elia Abou Chawareb, Faysal A. Yafi, Olivia H. Chang, Jessica M. Yih

PMC · DOI: 10.1177/17562872251407371 · Therapeutic Advances in Urology · 2026-01-02

## TL;DR

This study compares how well AI tools like ChatGPT and Copilot explain women's sexual health topics compared to a trusted website, finding they are accurate and relevant.

## Contribution

The study is the first to evaluate AI language models for patient education in the under-researched field of women's sexual health.

## Key findings

- AI models showed comparable accuracy to a trusted patient education source.
- One expert rated AI models as significantly more relevant than the website.
- Expert oversight is needed to ensure AI content quality in sensitive health topics.

## Abstract

Artificial intelligence (AI) is increasingly used in patient education, especially with the rise in popularity of large language models (LLMs) such as ChatGPT, Microsoft Copilot, and DeepSeek, offering quick, accessible answers to health-related queries. Yet, in female sexual health, a field historically under-researched and stigmatized, AI’s role in patient-facing education has yet to be thoroughly explored.

To evaluate the accuracy and relevance of responses from ChatGPT, Copilot, and DeepSeek to common female sexual health questions, comparing them to the Prosayla website and to each other.

Twelve questions were developed based on content from the Prosayla website, covering topics ranging from menopause to sexual dysfunction. Responses were collected from the three LLMs and Prosayla. Two female sexual medicine experts independently rated each response for accuracy and relevance utilizing a six-point Likert scale (0–5) with a double-blind design being used to minimize bias. One-way ANOVA and Bonferroni post hoc analyses were used to assess statistical significance (p < 0.05).

No significant differences in accuracy scores were observed across the four sources for Physician A (p = 0.558) or Physician B (p = 0.052), although ChatGPT was rated significantly more accurate than Prosayla in post hoc analysis by Physician B (p = 0.044). Relevance scores differed by rater: Physician A found no differences across sources (p = 0.771), while Physician B rated all three AI models significantly higher in relevance than Prosayla (p < 0.001).

AI models demonstrated comparable accuracy to Prosayla (a trusted patient education source), with the models being more relevant for one of the raters. These findings suggest that AI tools may complement traditional educational materials and support patient learning. However, expert oversight remains essential to ensure content quality and appropriateness. Future efforts should develop structured strategies and implementation frameworks to responsibly integrate AI into patient education, particularly in sensitive areas like women’s sexual health.

## Full-text entities

- **Diseases:** sexual dysfunction (MESH:D012735)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12759124/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12759124/full.md

## References

14 references — full list in the complete paper: https://tomesphere.com/paper/PMC12759124/full.md

---
Source: https://tomesphere.com/paper/PMC12759124