DEBATE: A Large-Scale Benchmark for Evaluating Opinion Dynamics in Role-Playing LLM Agents
Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, You Li, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, Timothy T. Rogers

TL;DR
DEBATE is a comprehensive benchmark for evaluating how well role-playing large language models simulate realistic opinion dynamics and group behaviors in social interactions, addressing limitations of prior simulations.
Contribution
We introduce DEBATE, a large-scale, publicly available benchmark with real human data for assessing opinion change and group behavior in multi-agent LLM simulations.
Findings
RPLA groups show strong opinion convergence similar to humans in zero-shot settings.
Supervised fine-tuning improves stance alignment and convergence accuracy.
Discrepancies in opinion change and belief updating still exist after training.
Abstract
Accurately modeling opinion change through social interactions is crucial for understanding and mitigating polarization, misinformation, and societal conflict. Recent work simulates opinion dynamics with role-playing LPL agents (RPLAs), but multi-agent simulations often display unnatural group behavior, such as premature convergence, and lack empirical benchmarks for assessing alignment with real human group interactions. We introduce DEBATE, a large-scale benchmark for evaluating the authenticity of opinion dynamics in multi-agent RPLA simulations. DEBATE contains 30,707 messages from 2,832 U.S.-based participants across 708 groups and 107 topics, with both public messages and private Likert-scale beliefs, enabling evaluation at the utterance and group levels while also supporting future individual-level analyses. We instantiate "digital twin" RPLAs with seven LLMs and evaluate them in…
Peer Reviews
Decision·Submitted to ICLR 2026
- Important and Novel Problem: The paper tackles a timely and critical question. As LLMs are increasingly used to simulate social interactions, we urgently need rigorous ways to check if these simulations are socially realistic. DEBATE is, to my knowledge, the first large-scale empirical benchmark specifically for multi-agent opinion dynamics. - Data Collection: The study's design is its biggest strength. Separating 'public speech' from 'private belief' is a major contribution. It allows evalua
- Lack of Deeper Discussion on SFT Failure: The SFT results (Appendix N) are fascinating but underexplored. The paper states that naive SFT fails, but doesn't deeply explore why. Is it because the model learns to imitate an "average" human, losing individual diversity? Is the next-token prediction objective simply wrong for modeling a latent process like belief update? The paper would be stronger if it discussed alternative training objectives (e.g., RL based on realism, or explicit belief-track
The paper addresses a well-motivated and underexplored problem that existing RPLA simulations often display unnatural group behavior, such as premature consensus, and lack a benchmark to measure how human-like their opinion dynamics are. 1. The data collection is the paper’s strongest contribution. The authors conduct tightly controlled multi-party, multi-round human discussions that capture both public messages and private beliefs, yielding over 37K utterances from about 2,800 U.S. participant
1. The dataset is based on controlled four-person discussions with enforced turn-taking. While this setup ensures structured and clean data, it limits the natural flow of interaction and may not reflect opinion evolution in open or large-scale social settings. 2. The three simulation modes—Next Message Prediction, Tweet-guided Simulation, and Full Conversation Simulation—lack clear theoretical separation. Clarifying the motivation and analytical purpose of each mode would make the framework mor
- The dataset does seem to be both novel and relevant for the very timely problem of evaluating LLM agent-based social simulations. Having both internal stances and external utterances, as well as the diverse range of conversation topics and metadata, makes this a promising dataset. - The analysis RQs are interesting and naturally emerge from the dataset construction - in its current form, I don’t think you need to frame the paper as mainly a dataset paper, as the analyses are also pretty intere
- My main concern is that there is not much quality validation of the human conversations in the dataset for a benchmark paper. It appears that all on-topic utterances that were a part of completed interactions were used for evaluation. More quality validation of the resulting dataset might be useful - I’m concerned that, as crowdworkers were completing a single-episode task with no incentive for honestly reporting preferences, there might be significant numbers of low-quality interactions that
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
