Why Does ChatGPT "Delve" So Much? Exploring the Sources of Lexical   Overrepresentation in Large Language Models

Tom S. Juzek; Zina B. Ward

arXiv:2412.11385·cs.CL·December 17, 2024·2 cites

Why Does ChatGPT "Delve" So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models

Tom S. Juzek, Zina B. Ward

PDF

Open Access 1 Repo

TL;DR

This paper investigates why large language models overuse certain scientific words like "delve" in abstracts, finding that model architecture and training data are unlikely causes, and exploring the role of reinforcement learning from human feedback.

Contribution

The study develops a formal method to identify lexical overrepresentation in scientific texts and examines potential causes, including RLHF, highlighting the complexity of language change driven by LLMs.

Findings

01

Identified 21 words overused in scientific abstracts likely due to LLMs.

02

Model architecture and training data are unlikely causes of lexical overrepresentation.

03

Reinforcement learning from human feedback may influence overuse, but evidence is inconclusive.

Abstract

Scientific English is currently undergoing rapid change, with words like "delve," "intricate," and "underscore" appearing far more frequently than just a few years ago. It is widely assumed that scientists' use of large language models (LLMs) is responsible for such trends. We develop a formal, transferable method to characterize these linguistic changes. Application of our method yields 21 focal words whose increased occurrence in scientific abstracts is likely the result of LLM usage. We then pose "the puzzle of lexical overrepresentation": WHY are such words overused by LLMs? We fail to find evidence that lexical overrepresentation is caused by model architecture, algorithm choices, or training data. To assess whether reinforcement learning from human feedback (RLHF) contributes to the overuse of focal words, we undertake comparative model testing and conduct an exploratory online…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tjuzek/delve
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Topic Modeling · Artificial Intelligence in Healthcare and Education