Language Model Council: Democratically Benchmarking Foundation Models on   Highly Subjective Tasks

Justin Zhao; Flor Miriam Plaza-del-Arco; Benjamin Genchel; Amanda; Cercas Curry

arXiv:2406.08598·cs.CL·March 20, 2025

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

Justin Zhao, Flor Miriam Plaza-del-Arco, Benjamin Genchel, Amanda, Cercas Curry

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

The paper introduces the Language Model Council, a democratic system of multiple LLMs collaboratively evaluating subjective tasks, resulting in more robust and human-aligned rankings than single-model judgments.

Contribution

It proposes a novel multi-LLM council approach for subjective evaluation, demonstrating improved robustness and alignment with human judgments over traditional single-model methods.

Findings

01

LMC produces more separable rankings.

02

LMC rankings are more consistent with human evaluations.

03

Using multiple LLMs enhances evaluation robustness.

Abstract

As Large Language Models (LLMs) continue to evolve, evaluating them remains a persistent challenge. Many recent evaluations use LLMs as judges to score outputs from other LLMs, often relying on a single large model like GPT-4o. However, using a single LLM judge is prone to intra-model bias, and many tasks - such as those related to emotional intelligence, creative writing, and persuasiveness - may be too subjective for a single model to judge fairly. We introduce the Language Model Council (LMC), where a group of LLMs collaborate to create tests, respond to them, and evaluate each other's responses to produce a ranking in a democratic fashion. Unlike previous approaches that focus on reducing cost or bias by using a panel of smaller models, our work examines the benefits and nuances of a fully inclusive LLM evaluation system. In a detailed case study on emotional intelligence, we deploy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llm-council/llm-council
none

Datasets

llm-council/emotional_application
dataset· 13 dl
13 dl

Videos

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks· underline

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Natural Language Processing Techniques

MethodsFocus · Sparse Evolutionary Training · ALIGN