When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation
Sandro Andric

TL;DR
This paper investigates how large language models' reasoning abilities may hinder their effectiveness in simulating realistic multi-agent negotiations, emphasizing the importance of aligning model evaluation with behavioral objectives.
Contribution
It reveals that stronger reasoning in LLMs can lead to rigid authority decisions in negotiations, highlighting the need to evaluate models based on their behavioral roles rather than strategic prowess.
Findings
Native reasoning models often default to authority-heavy outcomes.
Increased reasoning does not improve behavioral diversity in negotiations.
Structural scaffolds are more effective than reasoning enhancements in promoting negotiated outcomes.
Abstract
Behavioral simulation and strategic problem solving are different tasks. Large language models are increasingly explored as agents in policy-facing institutional simulations, but stronger reasoning need not improve behavioral sampling. We study this solver-sampler mismatch in three multi-agent negotiation environments: two trading-limits scenarios with different authority structures and a grid-curtailment case in emergency electricity management. Across two primary model families, native reasoning and often no reflection collapse toward authority-heavy outcomes. The sharpest case is DeepSeek native reasoning in the grid-curtailment transfer: it reaches action entropy 1.256 and a concession-arc rate of 0.933, yet still ends in authority decision in 15 of 15 runs. A direct OpenAI extension shows the same pressure at provider breadth: GPT-5.2 native reasoning ends in authority decisions in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
