Why Does ChatGPT "Delve" So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models
Tom S. Juzek, Zina B. Ward

TL;DR
This paper investigates why large language models overuse certain scientific words like "delve" in abstracts, finding that model architecture and training data are unlikely causes, and exploring the role of reinforcement learning from human feedback.
Contribution
The study develops a formal method to identify lexical overrepresentation in scientific texts and examines potential causes, including RLHF, highlighting the complexity of language change driven by LLMs.
Findings
Identified 21 words overused in scientific abstracts likely due to LLMs.
Model architecture and training data are unlikely causes of lexical overrepresentation.
Reinforcement learning from human feedback may influence overuse, but evidence is inconclusive.
Abstract
Scientific English is currently undergoing rapid change, with words like "delve," "intricate," and "underscore" appearing far more frequently than just a few years ago. It is widely assumed that scientists' use of large language models (LLMs) is responsible for such trends. We develop a formal, transferable method to characterize these linguistic changes. Application of our method yields 21 focal words whose increased occurrence in scientific abstracts is likely the result of LLM usage. We then pose "the puzzle of lexical overrepresentation": WHY are such words overused by LLMs? We fail to find evidence that lexical overrepresentation is caused by model architecture, algorithm choices, or training data. To assess whether reinforcement learning from human feedback (RLHF) contributes to the overuse of focal words, we undertake comparative model testing and conduct an exploratory online…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Topic Modeling · Artificial Intelligence in Healthcare and Education
