ChatGPT "contamination": estimating the prevalence of LLMs in the scholarly literature
Andrew Gray

TL;DR
This paper estimates that over 1% of scholarly articles in 2023 were assisted by Large Language Models like ChatGPT, using keyword analysis to identify LLM-influenced writing in academic publications.
Contribution
It introduces a method to estimate LLM usage in scholarly literature by analyzing keyword prevalence, providing the first large-scale quantitative estimate of LLM-assisted publications.
Findings
At least 60,000 papers in 2023 likely used LLMs
Keywords show a significant increase in LLM-related terms in 2023
Proportion of LLM-assisted papers exceeds 1% of total articles
Abstract
The use of ChatGPT and similar Large Language Model (LLM) tools in scholarly communication and academic publishing has been widely discussed since they became easily accessible to a general audience in late 2022. This study uses keywords known to be disproportionately present in LLM-generated text to provide an overall estimate for the prevalence of LLM-assisted writing in the scholarly literature. For the publishing year 2023, it is found that several of those keywords show a distinctive and disproportionate increase in their prevalence, individually and in combination. It is estimated that at least 60,000 papers (slightly over 1% of all articles) were LLM-assisted, though this number could be extended and refined by analysis of other characteristics of the papers or by identification of further indicative keywords.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Radiomics and Machine Learning in Medical Imaging
