Delving into LLM-assisted writing in biomedical publications through excess vocabulary

Dmitry Kobak; Rita Gonz\'alez-M\'arquez; Em\H{o}ke-\'Agnes Horv\'at; Jan Lause

arXiv:2406.07016·cs.CL·July 4, 2025·34 cites

Delving into LLM-assisted writing in biomedical publications through excess vocabulary

Dmitry Kobak, Rita Gonz\'alez-M\'arquez, Em\H{o}ke-\'Agnes Horv\'at, Jan Lause

PDF

Open Access 1 Repo

TL;DR

This study investigates the widespread influence of large language models like ChatGPT on biomedical research writing by analyzing vocabulary changes in over 15 million abstracts, revealing significant LLM usage and style shifts in recent years.

Contribution

It introduces a large-scale, unbiased method to detect LLM-assisted writing in biomedical literature through vocabulary analysis, quantifying its prevalence across disciplines and regions.

Findings

01

At least 13.5% of 2024 abstracts processed with LLMs

02

LLM influence varies by discipline, country, and journal

03

Impact of LLMs surpasses major world events like COVID-19

Abstract

Large language models (LLMs) like ChatGPT can generate and revise text with human-level performance. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists use them for their scholarly writing. But how wide-spread is such LLM usage in the academic literature? To answer this question for the field of biomedical research, we present an unbiased, large-scale approach: we study vocabulary changes in over 15 million biomedical abstracts from 2010--2024 indexed by PubMed, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. This excess word analysis suggests that at least 13.5% of 2024 abstracts were processed with LLMs. This lower bound differed across disciplines, countries, and journals, reaching 40% for some subcorpora. We show that LLMs have…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

berenslab/chatgpt-excess-words
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling