# Detection of Medical Misinformation in Hemangioma Patient Education: Comparative Study of ChatGPT-4o and DeepSeek-R1 Large Language Models

**Authors:** Guoyong Wang, Ye Zhang, Weixin Wang, Yingjie Zhu, Wei Lu, Chaonan Wang, Hui Bi, Xiaonan Yang

PMC · DOI: 10.2196/76372 · 2025-11-18

## TL;DR

This study compares two AI models in detecting medical misinformation about hemangiomas, finding one more accurate and reliable than the other.

## Contribution

The study provides empirical evidence on the performance of ChatGPT-4o and DeepSeek-R1 in identifying medical rumors related to hemangiomas.

## Key findings

- DeepSeek-R1 outperformed ChatGPT-4o in accuracy, precision, and recall for classifying medical information.
- Expert evaluations showed DeepSeek-R1 had a significant advantage in detecting medical rumors.
- Both models showed similar semantic stability in their outputs.

## Abstract

This study examines the capability of large language models (LLMs) in detecting medical rumors, using hemangioma-related information as an example. It compares the performances of ChatGPT-4o and DeepSeek-R1.

This study aimed to evaluate and compare the accuracy, stability, and expert-rated reliability of 2 LLMs, ChatGPT-4o and DeepSeek-R1, in classifying medical information related to hemangiomas as either “rumors” or “accurate information.”

We collected 82 publicly available texts from social media platforms, medical education websites, international guidelines, and journals. Of the 82 items, 47/82 (57%) were labeled as “rumors,” and 35/82 (43%) were labeled as “accurate information.” Three vascular anomaly specialists with extensive clinical experience independently annotated the texts in a double-blinded manner, and disagreements were resolved by arbitration to ensure labeling reliability. Subsequently, these texts were input into ChatGPT-4o and DeepSeek-R1, with each model generating 2 rounds of results under identical instructions. Output stability was assessed using bidirectional encoder representations from transformers–based semantic similarity scores. Classification accuracy, precision, recall, and F1-score were calculated to evaluate the performance. Additionally, 2 medical experts independently rated the model outputs using a 5-point scale based on clinical guidelines. Statistical analyses included paired t tests, Wilcoxon signed-rank tests, and bootstrap resampling to compute confidence intervals.

In terms of semantic stability, the similarity distributions for the 2 models largely overlapped, with no statistically significant difference observed (mean difference=−0.003, 95% CI −0.011 to 0.005; P=.30). Regarding classification performance, DeepSeek-R1 achieved higher accuracy (0.963) compared to ChatGPT-4o (0.910), and also performed better in terms of precision (0.978 vs 0.940), recall (0.957 vs 0.894), and F1-score (0.967 vs 0.916). Expert evaluations revealed that DeepSeek-R1 significantly outperformed ChatGPT-4o on both “rumor” items (mean difference=0.431; P<.001; Cohen dz=0.594) and “accurate information” items (mean difference=0.264; P=.045; Cohen dz=0.352), with a particularly pronounced advantage in rumor detection.

DeepSeek-R1 demonstrated greater accuracy and rationale in detecting medical rumors compared with ChatGPT-4o. This study provides empirical support for the application of LLMs and recommends optimizing accuracy and incorporating real-time verification mechanisms to mitigate the harmful impact of misleading information on patient health.

## Linked entities

- **Diseases:** hemangioma (MONDO:0006500)

## Full-text entities

- **Diseases:** vascular anomaly (MESH:D020785), Hemangioma (MESH:D006391)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12627899/full.md

---
Source: https://tomesphere.com/paper/PMC12627899