TL;DR
The paper introduces LOL, a benchmark-free, multi-round mutual evaluation framework for LLMs that addresses evaluation challenges like data contamination and subjectivity, providing more reliable and insightful assessments.
Contribution
It proposes a novel league-based evaluation paradigm integrating four core criteria, enabling effective LLM ranking and revealing new empirical insights.
Findings
LOL effectively distinguishes LLM capabilities with high ranking stability.
Memorization-based answering behaviors are observed in some models.
Higher in-family scores are found in the OpenAI model family.
Abstract
Although large language models (LLMs) have shown exceptional capabilities across a wide range of tasks, reliable evaluation remains a critical challenge due to data contamination, opaque operation, and subjective preferences. To address these issues, we propose League of LLMs (LOL), a novel benchmark-free evaluation paradigm that organizes multiple LLMs into a self-governed league for multi-round mutual evaluation. LOL integrates four core criteria (dynamic, transparent, objective, and professional) to mitigate key limitations of existing paradigms. Experiments on eight mainstream LLMs in mathematics and programming demonstrate that LOL can effectively distinguish LLM capabilities while maintaining high internal ranking stability (Top- consistency ). Beyond ranking, LOL reveals empirical findings that are difficult for traditional paradigms to capture. For instance,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
