Delving into LLM-assisted writing in biomedical publications through excess vocabulary
Dmitry Kobak, Rita Gonz\'alez-M\'arquez, Em\H{o}ke-\'Agnes Horv\'at, Jan Lause

TL;DR
This study investigates the widespread influence of large language models like ChatGPT on biomedical research writing by analyzing vocabulary changes in over 15 million abstracts, revealing significant LLM usage and style shifts in recent years.
Contribution
It introduces a large-scale, unbiased method to detect LLM-assisted writing in biomedical literature through vocabulary analysis, quantifying its prevalence across disciplines and regions.
Findings
At least 13.5% of 2024 abstracts processed with LLMs
LLM influence varies by discipline, country, and journal
Impact of LLMs surpasses major world events like COVID-19
Abstract
Large language models (LLMs) like ChatGPT can generate and revise text with human-level performance. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists use them for their scholarly writing. But how wide-spread is such LLM usage in the academic literature? To answer this question for the field of biomedical research, we present an unbiased, large-scale approach: we study vocabulary changes in over 15 million biomedical abstracts from 2010--2024 indexed by PubMed, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. This excess word analysis suggests that at least 13.5% of 2024 abstracts were processed with LLMs. This lower bound differed across disciplines, countries, and journals, reaching 40% for some subcorpora. We show that LLMs have…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling
