Evaluating the Performance of Large Language Models via Debates

Behrad Moniri; Hamed Hassani; Edgar Dobriban

arXiv:2406.11044·cs.CL·February 11, 2025

Evaluating the Performance of Large Language Models via Debates

Behrad Moniri, Hamed Hassani, Edgar Dobriban

PDF

Open Access 1 Video

TL;DR

This paper introduces an automated debate-based benchmarking framework for evaluating large language models, assessing their knowledge, reasoning, and consistency without human input, aligning well with human rankings.

Contribution

The paper presents a novel debate-based evaluation method for LLMs that is scalable, domain-agnostic, and reduces reliance on human judgment.

Findings

01

Debate framework correlates with human rankings

02

Effective assessment of reasoning and inconsistency

03

Eliminates need for costly human evaluation

Abstract

Large Language Models (LLMs) are rapidly evolving and impacting various fields, necessitating the development of effective methods to evaluate and compare their performance. Most current approaches for performance evaluation are either based on fixed, domain-specific questions that lack the flexibility required in many real-world applications, or rely on human input, making them unscalable. To address these issues, we propose an automated benchmarking framework based on debates between LLMs, judged by another LLM. This method assesses not only domain knowledge, but also skills such as argumentative reasoning and inconsistency recognition. We evaluate the performance of various state-of-the-art LLMs using the debate framework and achieve rankings that align closely with popular rankings based on human input, eliminating the need for costly human crowdsourcing.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Evaluating the Performance of Large Language Models via Debates· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsALIGN