# AI misuse of retracted literature: A comparative study of ChatGPT4o, deepseek, and grok 3 in stem cell research

**Authors:** Lan Yao, Tianshu Gu, Xuexin Li, Yan Jiao, Minghui Li, J. Carolyn Graff, Yulan Li, Weikuan Gu

PMC · DOI: 10.1007/s00114-025-02036-5 · Die Naturwissenschaften · 2025-11-03

## TL;DR

This study compares how three AI models handle information from retracted scientific articles in stem cell research, finding significant differences in their accuracy and reliability.

## Contribution

The study introduces a novel comparative analysis of AI models' handling of retracted scientific literature in a specific scientific domain.

## Key findings

- ChatGPT4o retrieved 80% of retracted articles and recognized 62% of them as retracted.
- Grok 3 fabricated answers based on retracted articles in 63% of cases.
- DeepSeek performed the worst, fabricating answers in 88% of cases and failing to recognize retraction status.

## Abstract

DeepSeek and Grok 3 appear as strong competitors to AI models, particularly the widely accepted model, ChatGPT. The accuracy of the utilization of data in retracted scientific articles has proven to be a significant challenge for AI as an assistant in scientific research. It is critical to understand whether and how three AI models handle information from retracted articles when they answer scientific questions. We collected retracted articles and used AI models to generate questions and analyzed the answers. The answers were compared and evaluated among three AI models. Here we show that these three models utilized 84 out of 93 retracted articles in their answers about stem cells. ChatGPT4o retrieved 74 out of 93 (80%) articles and recognized the retract status for 46 (62%) of them. DeepSeek only found one retracted article and did not realize its retraction status. Grok 4 retrieved 69 (74%) articles and recognized the retraction status of 46 (67%) of them. In cases when the retracted articles were not identified, ChatGPT fabricated articles 5 times out of 19 (26%) for its answers. Grok 3 fabricated 15 articles out of 24 (63%) for its answers. In 82 times of 93 (88%) answers, DeepSeek fabricated the articles in various forms. The answering styles from ChatGPT4o, DeepSeek, and Grok 3 are characterized by accurate and straightforward, a tangential structure and guesswork, and comprehensive and detailed answers, respectively. Analysis with non-retracted articles revealed the similar patterns of these models. This study suggests that, while no model is perfect, DeepSeek performed the worst when facing in-depth scientific real-world challenges. Much improvement has to be made before any of these AI models become problem-free and valuable for scientists.

The online version contains supplementary material available at 10.1007/s00114-025-02036-5.

## Full-text entities

- **Genes:** MIR34A (microRNA 34a) [NCBI Gene 407040] {aka MIRN34A, miRNA34A, mir-34, mir-34a}
- **Diseases:** GBM (MESH:D005909)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12583397/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12583397/full.md

## References

2 references — full list in the complete paper: https://tomesphere.com/paper/PMC12583397/full.md

---
Source: https://tomesphere.com/paper/PMC12583397