Debate, Deliberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Evaluation

Abir Harrasse; Chaithanya Bandi; Hari Bandi

arXiv:2410.04663·cs.CL·January 27, 2026

Debate, Deliberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Evaluation

Abir Harrasse, Chaithanya Bandi, Hari Bandi

PDF

Open Access 1 Video

TL;DR

D3 introduces a structured, adversarial multi-agent framework for LLM evaluation that enhances reliability, interpretability, and cost-efficiency through debate protocols and theoretical guarantees.

Contribution

It presents a novel, cost-aware evaluation framework with theoretical analysis and state-of-the-art empirical performance, addressing bias and inconsistency issues in LLM assessment.

Findings

01

Achieves high agreement with human judgments

02

Reduces positional and verbosity biases

03

Offers a cost-accuracy trade-off with budgeted stopping

Abstract

The evaluation of Large Language Models (LLMs) remains challenging due to inconsistency, bias, and the absence of transparent decision criteria in automated judging. We present Debate, Deliberate, Decide (D3), a cost-aware, adversarial multi-agent framework that orchestrates structured debate among role-specialized agents (advocates, a judge, and an optional jury) to produce reliable and interpretable evaluations. D3 instantiates two complementary protocols: (1) Multi-Advocate One-Round Evaluation (MORE), which elicits k parallel defenses per answer to amplify signal via diverse advocacy, and (2) Single-Advocate Multi-Round Evaluation (SAMRE) with budgeted stopping, which iteratively refines arguments under an explicit token budget and convergence checks. We develop a probabilistic model of score gaps that (i) characterizes reliability and convergence under iterative debate and (ii)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Debate, Deliberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Evaluation· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques