League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

Qianhong Guo; Wei Xie; Xiaofang Cai; Enze Wang; Shuoyoucheng Ma; Xiaobing Sun; Tian Xia; Kai Chen; Xiaofeng Wang; Baosheng Wang

arXiv:2507.22359·cs.AI·April 15, 2026

League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma, Xiaobing Sun, Tian Xia, Kai Chen, Xiaofeng Wang, Baosheng Wang

PDF

1 Repo

TL;DR

The paper introduces LOL, a benchmark-free, multi-round mutual evaluation framework for LLMs that addresses evaluation challenges like data contamination and subjectivity, providing more reliable and insightful assessments.

Contribution

It proposes a novel league-based evaluation paradigm integrating four core criteria, enabling effective LLM ranking and revealing new empirical insights.

Findings

01

LOL effectively distinguishes LLM capabilities with high ranking stability.

02

Memorization-based answering behaviors are observed in some models.

03

Higher in-family scores are found in the OpenAI model family.

Abstract

Although large language models (LLMs) have shown exceptional capabilities across a wide range of tasks, reliable evaluation remains a critical challenge due to data contamination, opaque operation, and subjective preferences. To address these issues, we propose League of LLMs (LOL), a novel benchmark-free evaluation paradigm that organizes multiple LLMs into a self-governed league for multi-round mutual evaluation. LOL integrates four core criteria (dynamic, transparent, objective, and professional) to mitigate key limitations of existing paradigms. Experiments on eight mainstream LLMs in mathematics and programming demonstrate that LOL can effectively distinguish LLM capabilities while maintaining high internal ranking stability (Top- $k$ consistency $= 70.7%$ ). Beyond ranking, LOL reveals empirical findings that are difficult for traditional paradigms to capture. For instance,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.