# Evaluating GPT-4 Responses on Scars or Keloids for Patient Education: Large Language Model Evaluation Study

**Authors:** Mingjun Rao, Tang Xiujun, Wang Haoyu

PMC · DOI: 10.2196/78838 · JMIR Medical Informatics · 2026-02-27

## TL;DR

This study evaluates GPT-4's ability to provide accurate and understandable patient education on scars and keloids, finding it reliable but needing improvements in readability and reference accuracy.

## Contribution

The study introduces a systematic evaluation of GPT-4 for patient education on scars and keloids using multiple assessment tools and expert ratings.

## Key findings

- GPT-4 showed high accuracy and reliability in answering questions about scars and keloids.
- Readability was moderate, corresponding to a 12th-grade level, requiring simplification for broader accessibility.
- 11.8% of generated references were hallucinated, indicating a need for improved reference validation.

## Abstract

Scars and keloids impose significant physical and psychological burdens on patients, often leading to functional limitations, cosmetic concerns, and mental health issues such as anxiety or depression. Patients increasingly turn to online platforms for information; however, existing web-based resources on scars and keloids are frequently unreliable, fragmented, or difficult to understand. Large language models such as GPT-4 show promise for delivering medical information, but their accuracy, readability, and potential to generate hallucinated content require validation for patient education applications.

This study aimed to systematically evaluate GPT-4’s performance in providing patient education on scars and keloids, focusing on its accuracy, reliability, readability, and reference quality.

This study involved collecting 354 questions from Reddit communities (r/Keloids, r/SCAR, and r/PlasticSurgery), covering topics including treatment options, pre- and postoperative care, and psychological impacts. Each question was input into GPT-4 in independent sessions to mimic real-world patient interactions. Responses were evaluated using multiple tools: the Patient Education Materials Assessment Tool-Artificial Intelligence for understandability and actionability, DISCERN-AI for treatment information quality, the Global Quality Scale for overall information quality, and standard readability metrics (Flesch Reading Ease score, and Gunning Fog Index). Three plastic surgeons used the Natural Language Assessment Tool for Artificial Intelligence to rate the accuracy, safety, and clinical appropriateness, while the Reference Evaluation for Artificial Intelligence tool validated references for reference hallucination, relevance, and source quality. We conducted the same analysis to assess the quality of GPT-4–generated content in response to questions from 3 medical websites.

GPT-4 demonstrated high accuracy and reliability. The Patient Education Materials Assessment Tool-Artificial Intelligence showed 75.5% understandability, DISCERN-AI rated responses as “good” (26.3/35), and the Global Quality Scale score was 4.28 of 5. Surgeons’ evaluations averaged 3.94 to 4.43 out of 5 across dimensions (accuracy 3.9, SD 0.7; safety 4.3, SD 0.8; clinical appropriateness 4.4, SD 0.5; actionability 4.1, SD 0.8; and effectiveness 4.1, SD 0.8). Readability analyses indicated moderate complexity (Flesch Reading Ease Score: 50.13; Gunning Fog Index: 12.68), corresponding to a 12th-grade reading level. Reference Evaluation for Artificial Intelligence identified 11.8% (383/3250) hallucinated references, while 88.2% (2867/3250) of references were real, with 95.1% (2724/2867) from authoritative sources (eg, government guidelines and the literature). The overall results about questions from medical websites were consistent with the answers to Reddit questions.

GPT-4 has serious potential as a patient education tool for scars and keloids, offering reliable and accurate information. However, improvements in readability (to align with sixth to eighth grade standards) and reduction of reference hallucinations are essential to enhance accessibility and trustworthiness. Future large language model optimizations should prioritize simplifying medical language and strengthening reference validation mechanisms to maximize clinical utility.

## Full-text entities

- **Diseases:** sleep apnea (MESH:D012891), PEMAT (MESH:D005547), depression (MESH:D003866), dyspareunia (MESH:D004414), Keloids (MESH:D007627), dysmenorrhea (MESH:D004412), LLM hallucination (MESH:D006212), Scars (MESH:D002921), LLMs (MESH:D007806), AI (MESH:C538142), anxiety (MESH:D001007), GQS (MESH:C538175), prostate cancer (MESH:D011471)
- **Chemicals:** GPT-4 (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12954683/full.md

## References

58 references — full list in the complete paper: https://tomesphere.com/paper/PMC12954683/full.md

---
Source: https://tomesphere.com/paper/PMC12954683