Word Overuse and Alignment in Large Language Models: The Influence of Learning from Human Feedback
Tom S. Juzek, Zina B. Ward

TL;DR
This paper investigates how Learning from Human Feedback influences lexical overuse in Large Language Models, revealing that LHF can cause models to prefer certain words, leading to potential misalignment and divergence in lexical expectations.
Contribution
It introduces a simple method to detect LHF-induced lexical preferences and experimentally links LHF to overuse of specific terms in LLMs.
Findings
LHF contributes to lexical overuse in LLMs.
Participants prefer text with certain words, indicating LHF influence.
Highlighting the divergence between LHF workers' and users' lexical expectations.
Abstract
Large Language Models (LLMs) are known to overuse certain terms like "delve" and "intricate." The exact reasons for these lexical choices, however, have been unclear. Using Meta's Llama model, this study investigates the contribution of Learning from Human Feedback (LHF), under which we subsume Reinforcement Learning from Human Feedback and Direct Preference Optimization. We present a straightforward procedure for detecting the lexical preferences of LLMs that are potentially LHF-induced. Next, we more conclusively link LHF to lexical overuse by experimentally emulating the LHF procedure and demonstrating that participants systematically prefer text variants that include certain words. This lexical overuse can be seen as a sort of misalignment, though our study highlights the potential divergence between the lexical expectations of different populations -- namely LHF workers versus LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
