When LLMs Play the Telephone Game: Cultural Attractors as Conceptual Tools to Evaluate LLMs in Multi-turn Settings
J\'er\'emy Perez, Grgur Kova\v{c}, Corentin L\'eger, C\'edric Colas, Gaia Molinaro, Maxime Derex, Pierre-Yves Oudeyer, Cl\'ement Moulin-Frier

TL;DR
This paper investigates how iterative interactions among large language models can amplify biases and lead to stable attractor states, revealing the importance of multi-turn dynamics in understanding LLM cultural evolution.
Contribution
It introduces a novel experimental framework using cultural transmission chains to study biases and attractors in LLM interactions over multiple turns.
Findings
Open-ended instructions amplify attraction effects.
Toxicity exhibits stronger attractor tendencies than length.
Different text properties respond variably to iterative biases.
Abstract
As large language models (LLMs) start interacting with each other and generating an increasing amount of text online, it becomes crucial to better understand how information is transformed as it passes from one LLM to the next. While significant research has examined individual LLM behaviors, existing studies have largely overlooked the collective behaviors and information distortions arising from iterated LLM interactions. Small biases, negligible at the single output level, risk being amplified in iterated interactions, potentially leading the content to evolve towards attractor states. In a series of telephone game experiments, we apply a transmission chain design borrowed from the human cultural evolution literature: LLM agents iteratively receive, produce, and transmit texts from the previous to the next agent in the chain. By tracking the evolution of text toxicity, positivity,…
Peer Reviews
Decision·ICLR 2025 Poster
Originality: Multi-turn interactions in LLMs are an interesting an less studied area in LLM evaluations on qualities such as bias. The telephone game is an easy to grasp concept that allows readers to conceptualize the experiments easily. Quality: The evaluations are conducted on both open- and closed-source models. The measurement methods of metrics are implemented reasonably. Clarity: Overall the paper is easy to follow. Graphs are colorful and visually appealing. Significance: The motiv
Originality: None Quality: - There is little justification for why "rephrase", "take inspiration from", and "continue" are chosen as the three tasks for the models to iterate on. - I don't think one of the main conclusions that the authors claim is supported by their experiments. The authors use these three tasks to make the claim that tasks that are less constrained have higher attractor strength, but the three conditions are difference in many different ways, not just the level of constrain
The research is grounded in a strong foundation of social science theory, specifically cultural attraction theory (CAT). The phenomenon they observe is novel and interesting, and the writing is clear.
- The paper does not explicitly provide theories or mechanisms underlying this phenomenon. Why do LLMs exhibit this human-like behavior of transmitting cultural patterns? Do LLMs simply mimic human behavioral patterns present in the training data, or does this behavior emerge due to specific objectives in LLM training? The authors may consider discussing this further or suggesting directions for future research. - Page 17, line 866: The paper relies on a small sample of initial texts (5 abstrac
In my view, the biggest strengths of the paper are in research question and approach. There exists an extremely large body of research in investigating the single-turn capabilities of LLMs but less so in multi-turn settings. For this, the specific aspect of how certain attributes act as 'attractors' in driving the LLMs responses is extremely relevant and the authors methodology for measuring this is well-founded in the literature. The results they show, that certain attributes are (understandabl
The primary weaknesses of the paper are in the limited number of models and that (to my understanding) each 'conversation' was conducted by the same 'model' (described in the paper as homogenous transmission chains) without an ablation of how a heterogenous transmission chain could shift results. While doing this with a number of agents n> 2 might be excessive, I do think a 2 agent heterogenous system could be reasonable.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Topic Modeling · Opinion Dynamics and Social Influence
