AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment
Dario Loi, Elena Maria Mui\`a, Federico Siciliano, Giovanni Trappolini, Vincenzo Cris\`a, Peter Kruger, Fabrizio Silvestri

TL;DR
AutoBench introduces an automated, peer-assessment framework for evaluating large language models that dynamically generates tasks and aggregates judgments, providing a scalable and contamination-resistant alternative to static benchmarks.
Contribution
It presents a novel reciprocal peer assessment methodology for LLM evaluation, enabling dynamic task generation and consensus-based ranking, validated through strong correlation with established benchmarks.
Findings
AutoBench correlates 78% with MMLU-Pro and 63% with GPQA.
Multi-judge evaluations outperform single-judge baselines.
Framework is scalable and resistant to test-set contamination.
Abstract
We present AutoBench, a fully automated and self-sustaining framework for evaluating Large Language Models (LLMs) through reciprocal peer assessment. This paper provides a rigorous scientific validation of the AutoBench methodology, originally developed as an open-source project by eZecute S.R.L.. Unlike static benchmarks that suffer from test-set contamination and limited adaptability, AutoBench dynamically generates novel evaluation tasks while models alternately serve as question generators, contestants, and judges across diverse domains. An iterative weighting mechanism amplifies the influence of consistently reliable evaluators, aggregating peer judgments into consensus-based rankings that reflect collective model agreement. Our experiments demonstrate strong correlations with established benchmarks including MMLU-Pro and GPQA (respectively 78\% and 63\%), validating this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
