# Implementing GPT‐4 Learning Models in Dermatology: An Assessment of Medical Quality and Utility

**Authors:** Aryan Naik, Peter Vien, Thanh‐Nga Tran

PMC · DOI: 10.1111/srt.70331 · Skin Research and Technology · 2026-02-18

## TL;DR

This study evaluates the medical quality of GPT-4's dermatology responses and finds they are generally safe but lack accuracy compared to expert sources.

## Contribution

The paper provides a novel comparative analysis of GPT-4 dermatology outputs against UpToDate using DISCERN scores and statistical validation.

## Key findings

- GPT-4 models generated dermatological content with 'poor' medical quality compared to UpToDate's 'fair' quality.
- ChatGPT-4 showed significantly higher concordance with UpToDate treatment recommendations than Copilot.
- LLM-generated content had few harmful recommendations but requires validation for reliable use in dermatology.

## Abstract

Artificial intelligence, including large language models (LLMs) such as GPT‐4, can generate responses to clinical queries using predictive algorithms trained on large online datasets. Current literature lacks a comprehensive assessment of the medical quality and accuracy of dermatologic GPT‐4‐generated outputs.

A standardized query was used to ask GPT‐4 models (Copilot and ChatGPT‐4) to generate summaries and treatment recommendations for 33 dermatologic conditions, which were then compared to corresponding sections of UpToDate (UTD) excerpts. DISCERN scores were calculated for each source by two authors (AN and PV). Concordance between GPT‐4‐generated treatments and UTD was evaluated by a certified dermatologist. Word count and Flesch Kincaid reading score were generated in R. Paired t‐tests and one‐way and weighted ANOVA were conducted in R.

The DISCERN instrument classified UTD content as being of “fair” medical quality (mean [SD], 3.08 [0.34]), while both ChatGPT‐4 and Copilot produced content of “poor” medical quality (mean [SD], 2.28 [0.22] and mean [SD], 2.31 [0.35], respectively). ChatGPT‐4's treatment recommendations demonstrated 33.5% greater average concordance with UTD treatment recommendations (mean [SD], 64.89% [29.29]), in comparison to Copilot (mean [SD], 31.38% [31.08%]); (95% CI, 22.3%–44.7%, p < 0.001).

Overall, GPT‐4 models produced dermatological content with few harmful recommendations. However, GPT‐4‐generated content performed poorly on the DISCERN instrument, and validation of LLM‐generated responses remains challenging. Results suggest LLM parameters and query structures may be optimizable for dermatologic applications. If implemented alongside the professional judgement of certified dermatologists, future LLMs may serve as time‐saving dermatologic tools, enhancing patient care.

## Full-text entities

- **Diseases:** BCC (MESH:D002280), dermatologic (MESH:D000168), hallucinate (MESH:D006212), LLMs (MESH:D007806), AI (MESH:C538142), cutaneous cancers (MESH:D009369)
- **Chemicals:** imiquimod (MESH:D000077271), FK (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12914478/full.md

## References

29 references — full list in the complete paper: https://tomesphere.com/paper/PMC12914478/full.md

---
Source: https://tomesphere.com/paper/PMC12914478