ChatGPT "contamination": estimating the prevalence of LLMs in the   scholarly literature

Andrew Gray

arXiv:2403.16887·cs.DL·March 26, 2024·29 cites

ChatGPT "contamination": estimating the prevalence of LLMs in the scholarly literature

Andrew Gray

PDF

Open Access

TL;DR

This paper estimates that over 1% of scholarly articles in 2023 were assisted by Large Language Models like ChatGPT, using keyword analysis to identify LLM-influenced writing in academic publications.

Contribution

It introduces a method to estimate LLM usage in scholarly literature by analyzing keyword prevalence, providing the first large-scale quantitative estimate of LLM-assisted publications.

Findings

01

At least 60,000 papers in 2023 likely used LLMs

02

Keywords show a significant increase in LLM-related terms in 2023

03

Proportion of LLM-assisted papers exceeds 1% of total articles

Abstract

The use of ChatGPT and similar Large Language Model (LLM) tools in scholarly communication and academic publishing has been widely discussed since they became easily accessible to a general audience in late 2022. This study uses keywords known to be disproportionately present in LLM-generated text to provide an overall estimate for the prevalence of LLM-assisted writing in the scholarly literature. For the publishing year 2023, it is found that several of those keywords show a distinctive and disproportionate increase in their prevalence, individually and in combination. It is estimated that at least 60,000 papers (slightly over 1% of all articles) were LLM-assisted, though this number could be extended and refined by analysis of other characteristics of the papers or by identification of further indicative keywords.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Radiomics and Machine Learning in Medical Imaging