Do as We Do, Not as You Think: the Conformity of Large Language Models
Zhiyuan Weng, Guikun Chen, Wenguan Wang

TL;DR
This paper investigates conformity behaviors in large language model-driven multi-agent systems, introducing a new benchmark to measure conformity and proposing strategies to mitigate its effects for more ethical and reliable collaboration.
Contribution
It introduces BenchForm, a novel benchmark for assessing conformity in LLM multi-agent systems, and explores mitigation strategies like persona enhancement and reflection mechanisms.
Findings
Conformity varies with interaction time and majority size.
Enhanced personas can reduce conformity rates.
Reflection mechanisms improve agent independence.
Abstract
Recent advancements in large language models (LLMs) revolutionize the field of intelligent agents, enabling collaborative multi-agent systems capable of tackling complex problems across various domains. However, the potential of conformity within these systems, analogous to phenomena like conformity bias and groupthink in human group dynamics, remains largely unexplored, raising concerns about their collective problem-solving capabilities and possible ethical implications. This paper presents a comprehensive study on conformity in LLM-driven multi-agent systems, focusing on three aspects: the existence of conformity, the factors influencing conformity, and potential mitigation strategies. In particular, we introduce BenchForm, a new conformity-oriented benchmark, featuring reasoning-intensive tasks and five distinct interaction protocols designed to probe LLMs' behavior in collaborative…
Peer Reviews
Decision·ICLR 2025 Oral
* Presentation: The paper is well-written and is easy to follow. * Benchmark: Introduction of BENCHFORM offers a unique framework for studying conformity in LLM-driven agents. * Experimentation: The empirical studies are through providing great findings about possible ways that LLMs can be influenced, bringing attention to the ethical and policy challenges ahead.
* While I agree ${CR}^C$ also shows conformity, but if it helps LLMs to learn from the forum it can be beneficial. I recommend including that aspect in the paper. * 272: Flipping between Qwen2-72B and Qwen2-7B in the result section was confusing. I recommend either including both in all results are stick to one. Minor: * 223: To note that -> Note that, * 389: Can you add statistical significance on the chart?
This is a well-thought out and -presented piece of research. Originality: It is clearly located within an existing literature where, as the authors note, ‘Research about the phenomenon of conformity within LLM-driven multi-agent systems remains scarce’ (p. 9). One area of originality for the work is that it investigates long-term interactions. Quality: The way in which the authors draw on sociological research to help develop BenchMark is a good illustration of the importance of bringing ins
Overall, I enjoyed the paper. Areas of weakness are areas where I had specific questions arise. 1. Section 4: investigation into factors that influence conformity. Here, the experiments are conducted with Llama3-70B and Qwen2-72B. Why these two? Could the authors explain why these two were chosen as a way of expanding on the methodology used in the set of experiments reported in section 4. Can the authors explain whether there are characteristics of these two LLMs that would make them particula
This is an important topic, and one which I have not personally encountered previously. The paper provides value in identifying the potential for conformity to arise in multi-agent LLM systems, the problems which might arise from this, and in demonstrating via experimental results that these concerns do actually occur in practice. The BENCHFORM benchmark and the proposed metrics and testing protocols are a useful starting point for research into the conformity of LLMs. I can see other authors
At times it is a bit unclear what the desirable behaviour of the agents actually is. This is touched on in Section 7 where the author's comment on the fact that conformity is a double-edged sword - some degree of conformity may assist in consensus-building, but overly conforming agents essentially become useless as they just reinforce the majority answer. I think it would have been useful to include this discussion earlier in the paper, and explain how the proposed metrics relate to this. The
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
