When Two LLMs Debate, Both Think They'll Win

Pradyumna Shyama Prasad; Minh Nhat Nguyen

arXiv:2505.19184·cs.CL·June 10, 2025

When Two LLMs Debate, Both Think They'll Win

Pradyumna Shyama Prasad, Minh Nhat Nguyen

PDF

Open Access

TL;DR

This study evaluates large language models in a multi-turn debate setting, revealing systematic overconfidence, mutual overestimation, and misalignment between reasoning and confidence, raising concerns about their self-assessment abilities.

Contribution

It introduces a novel dynamic debate framework to assess LLMs' confidence calibration and uncovers significant overconfidence and reasoning-confidence misalignments.

Findings

01

Models exhibit systematic overconfidence with initial confidence averaging 72.9%.

02

Confidence tends to escalate rather than decrease during debates.

03

Mutual overestimation occurs in over 60% of debates.

Abstract

Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1) Systematic overconfidence: models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation: rather than reducing confidence as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCorporate Governance and Law