DEBATE: A Large-Scale Benchmark for Evaluating Opinion Dynamics in Role-Playing LLM Agents

Yun-Shiuan Chuang; Ruixuan Tu; Chengtao Dai; Smit Vasani; You Li; Binwei Yao; Michael Henry Tessler; Sijia Yang; Dhavan Shah; Robert Hawkins; Junjie Hu; Timothy T. Rogers

arXiv:2510.25110·cs.CL·March 24, 2026

DEBATE: A Large-Scale Benchmark for Evaluating Opinion Dynamics in Role-Playing LLM Agents

Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, You Li, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, Timothy T. Rogers

PDF

1 Datasets 3 Reviews

TL;DR

DEBATE is a comprehensive benchmark for evaluating how well role-playing large language models simulate realistic opinion dynamics and group behaviors in social interactions, addressing limitations of prior simulations.

Contribution

We introduce DEBATE, a large-scale, publicly available benchmark with real human data for assessing opinion change and group behavior in multi-agent LLM simulations.

Findings

01

RPLA groups show strong opinion convergence similar to humans in zero-shot settings.

02

Supervised fine-tuning improves stance alignment and convergence accuracy.

03

Discrepancies in opinion change and belief updating still exist after training.

Abstract

Accurately modeling opinion change through social interactions is crucial for understanding and mitigating polarization, misinformation, and societal conflict. Recent work simulates opinion dynamics with role-playing LPL agents (RPLAs), but multi-agent simulations often display unnatural group behavior, such as premature convergence, and lack empirical benchmarks for assessing alignment with real human group interactions. We introduce DEBATE, a large-scale benchmark for evaluating the authenticity of opinion dynamics in multi-agent RPLA simulations. DEBATE contains 30,707 messages from 2,832 U.S.-based participants across 708 groups and 107 topics, with both public messages and private Likert-scale beliefs, enabling evaluation at the utterance and group levels while also supporting future individual-level analyses. We instantiate "digital twin" RPLAs with seven LLMs and evaluate them in…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- Important and Novel Problem: The paper tackles a timely and critical question. As LLMs are increasingly used to simulate social interactions, we urgently need rigorous ways to check if these simulations are socially realistic. DEBATE is, to my knowledge, the first large-scale empirical benchmark specifically for multi-agent opinion dynamics. - Data Collection: The study's design is its biggest strength. Separating 'public speech' from 'private belief' is a major contribution. It allows evalua

Weaknesses

- Lack of Deeper Discussion on SFT Failure: The SFT results (Appendix N) are fascinating but underexplored. The paper states that naive SFT fails, but doesn't deeply explore why. Is it because the model learns to imitate an "average" human, losing individual diversity? Is the next-token prediction objective simply wrong for modeling a latent process like belief update? The paper would be stronger if it discussed alternative training objectives (e.g., RL based on realism, or explicit belief-track

Reviewer 02Rating 2Confidence 3

Strengths

The paper addresses a well-motivated and underexplored problem that existing RPLA simulations often display unnatural group behavior, such as premature consensus, and lack a benchmark to measure how human-like their opinion dynamics are. 1. The data collection is the paper’s strongest contribution. The authors conduct tightly controlled multi-party, multi-round human discussions that capture both public messages and private beliefs, yielding over 37K utterances from about 2,800 U.S. participant

Weaknesses

1. The dataset is based on controlled four-person discussions with enforced turn-taking. While this setup ensures structured and clean data, it limits the natural flow of interaction and may not reflect opinion evolution in open or large-scale social settings. 2. The three simulation modes—Next Message Prediction, Tweet-guided Simulation, and Full Conversation Simulation—lack clear theoretical separation. Clarifying the motivation and analytical purpose of each mode would make the framework mor

Reviewer 03Rating 4Confidence 4

Strengths

- The dataset does seem to be both novel and relevant for the very timely problem of evaluating LLM agent-based social simulations. Having both internal stances and external utterances, as well as the diverse range of conversation topics and metadata, makes this a promising dataset. - The analysis RQs are interesting and naturally emerge from the dataset construction - in its current form, I don’t think you need to frame the paper as mainly a dataset paper, as the analyses are also pretty intere

Weaknesses

- My main concern is that there is not much quality validation of the human conversations in the dataset for a benchmark paper. It appears that all on-topic utterances that were a part of completed interactions were used for evaluation. More quality validation of the resulting dataset might be useful - I’m concerned that, as crowdworkers were completing a single-episode task with no incentive for honestly reporting preferences, there might be significant numbers of low-quality interactions that

Code & Models

Datasets

seantw/DEBATE_LLM
dataset· 1.9k dl
1.9k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.