# Evaluating the quality of ChatGPT-generated medical information on major ophthalmic conditions: A comparative assessment against the EQIP tool and guidelines

**Authors:** Mingfang Hu, Pingping Zou, Teng Li, Yuying Wang

PMC · DOI: 10.1371/journal.pone.0334250 · PLOS One · 2025-10-16

## TL;DR

This study evaluates how well ChatGPT generates accurate medical information on top ophthalmic conditions compared to clinical guidelines.

## Contribution

The study introduces a modified EQIP tool to assess ChatGPT's medical content quality in ophthalmology.

## Key findings

- ChatGPT's responses achieved a median EQIP score of 18 out of 20.
- There was strong agreement between evaluators (Cohen’s kappa of 0.926).
- ChatGPT aligned with guidelines in 84% of cases.

## Abstract

Background: The use of artificial intelligence for creating medical information is on the rise. Nonetheless, the accuracy and reliability of such information require thorough assessment. As a language model capable of generating text, ChatGPT needs a detailed examination of its effectiveness in the healthcare domain.

Objective: This research sought to evaluate the precision of medical data produced by ChatGPT-4o (https://chat.openai.com/chat, accessed Mar. 12, 2025), concentrating on its capability to handle the top five ophthalmic issues that pose the greatest global health challenges. Furthermore, the investigation compared the AI’s answers to recognized medical guides.

Methods: This research employed an adapted version of the Ensuring Quality of Information for Patients (EQIP) instrument to evaluate the quality of ChatGPT’s replies. The guidelines for the five conditions were rephrased into pertinent queries. These queries were then fed into ChatGPT, employing benchmarking against established ophthalmology clinical guidelines, and the resulting answers were independently scrutinized for precision and consistency by two investigators. The consistency among raters was evaluated using Cohen’s kappa value.

Results: The median EQIP score across the five conditions was 18 (IQR 18-19). The modified EQIP instrument revealed a robust consensus between the two evaluators when assessing ChatGPT’s responses, as indicated by a Cohen’s kappa value of 0.926 (95% CI 0.875-0.977, P<0.001). The alignment between the ChatGPT responses and the guideline recommendations was 84% (21/25), as indicated by a Cohen’s kappa value of 0.658 (95% CI 0.317-0.999, P<0.001).

Conclusions: ChatGPT demonstrates robust quality and guideline compliance in producing medical content. Nevertheless, improvements are necessary to enhance the accuracy of quantitative data and ensure a more comprehensive coverage, thereby offering valuable insights for the advancement of medical information generation.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12530511/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12530511/full.md

## References

32 references — full list in the complete paper: https://tomesphere.com/paper/PMC12530511/full.md

---
Source: https://tomesphere.com/paper/PMC12530511