Generative Monoculture in Large Language Models
Fan Wu, Emily Black, Varun Chandrasekaran

TL;DR
This paper identifies and analyzes 'generative monoculture' in large language models, a tendency to produce less diverse outputs, which can both improve and harm performance depending on the context, raising concerns for high-impact applications.
Contribution
The paper introduces the concept of generative monoculture, demonstrates its prevalence in LLMs through experiments, and discusses potential root causes and mitigation strategies.
Findings
Generative monoculture is common in LLMs for tasks like review and code generation.
Simple sampling or prompting changes do not effectively reduce monoculture.
Root causes are likely embedded in the models' alignment processes.
Abstract
We introduce {\em generative monoculture}, a behavior observed in large language models (LLMs) characterized by a significant narrowing of model output diversity relative to available training data for a given task: for example, generating only positive book reviews for books with a mixed reception. While in some cases, generative monoculture enhances performance (e.g., LLMs more often produce efficient code), the dangers are exacerbated in others (e.g., LLMs refuse to share diverse opinions). As LLMs are increasingly used in high-impact settings such as education and web search, careful maintenance of LLM output diversity is essential to ensure a variety of facts and perspectives are preserved over time. We experimentally demonstrate the prevalence of generative monoculture through analysis of book review and code generation tasks, and find that simple countermeasures such as altering…
Peer Reviews
Decision·ICLR 2025 Poster
This paper highlights a critical issue in LLMs around narrowing of output diversity compared to the training data. The paper addresses an important problem esp when LLMs are being increasingly applied in diverse fields such as automated product reviews, sentiment analysis, scholarly paper summarization etc. The paper demonstrates the prevalence of narrowing of output diversity, which they refer to as 'generative monoculture'. They consider book reviews and code solutions as two primary use cases
While the paper tests various methods to mitigate 'monoculture', including temperature adjustment and prompting strategies, the attempted countermeasures showed limited efficacy in mitigating narrowing of output diversity. This warrants more experimentation and ideation. I would also think use cases/tasks other than book reviews and code generation should be investigated to test the generalizability of the method. Dialogue / chat bot as an application may be an important area to test these metho
The main idea of this paper is very interesting, and I am glad the authors have done this exploration. The authors have done a good job of discussing nuances around the merits of diversity, and I appreciate their selection of two complementary domains where the value of having diversity is quite different.
## Primary weakness - incomplete description of methodology Unfortunately, it is not possible to assess this paper as it was submitted because crucial information required to understand and reproduce the methodology is purported to be in the appendix, but no appendix was included in the submission. Since the paper is incomplete, there is no choice but to give a score of 1 (strong reject). Despite this, I have tried to leave some constructive feedback below for the authors. ## LLMs for attribute
The paper formalizes the idea of “monoculture”. This idea isn’t wildly novel–it’s intuitive and consistent with other similar ideas such as mode collapse–but to my knowledge there isn’t a clean documentation of it and thus the paper has value in being an official cite for this phenomenon The authors focus on measuring monoculture using task-specific notions of salient attributes (e.g., sentiment in book reviews, algorithms in code) which differs meaningfully from measures that use e.g., vocabul
My primary concern is that the evaluation focuses entirely on automatic metrics. Granted, there are many metrics that the authors use, and they are somewhat diverse. Still, many of the metrics rely on using LLMs themselves (mostly GPT 3.5) to evaluate LLM output. There is something circular (though hard to articulate) about doing this especially given the premise of the paper itself. That is: if we assume LLMs are not good at generating diverse outputs, might we also worry that they aren’t good
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution
