MACEval: A Multi-Agent Continual Evaluation Network for Large Models
Zijian Chen, Yuze Sun, Yuan Tian, Wenjun Zhang, Guangtao Zhai

TL;DR
MACEval introduces a dynamic, multi-agent evaluation framework for large models that addresses overfitting, maintenance challenges, and scalability issues of traditional benchmarks, enabling more efficient and longitudinal performance assessment.
Contribution
We propose MACEval, a novel multi-agent system for continual, autonomous evaluation of large models, reducing manual effort and improving adaptability over existing static benchmarks.
Findings
Effective evaluation across 23 large models
Reduces evaluation overhead significantly
Enables longitudinal performance tracking
Abstract
Hundreds of benchmarks dedicated to evaluating large models have been presented over the past few years. However, most of them remain closed-ended and are prone to overfitting due to the potential data contamination. Moreover, the increasing scale and scope of current benchmarks with transient metrics, as well as the heavily human-dependent curation procedure, pose significant challenges for timely maintenance and adaptation. In this paper, we introduce MACEval, a Multi-Agent Continual Evaluation network for dynamic evaluation of large models, and define new metrics to quantify performance longitudinally. MACEval employs an interactive and autonomous evaluation mode, utilizing role assignment, in-process data generation, and evaluation routing through a cascaded agent network. Extensive experiments on 23 large models demonstrate the effectiveness of MACEval, which also lightens the…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Good motivation
See summary
Automated eval methods are promising and an important avenue of work. This paper proposes a workflow to achieve automated eval results for LLMs.
Missing (very) related work: https://arxiv.org/pdf/2312.14856 https://arxiv.org/abs/2310.17567 https://arxiv.org/pdf/2502.06453 https://alignment.anthropic.com/2025/petri/ - It would be good to reflect on the quantity vs quality argument. - This is presented as an automated eval method but humans have to manually come up with the tasks (only 9 introduced in the paper). E.g. construct an eval where you can iteratively add more noise to an image. While this is fine (the above linked papers do sim
1. **Motivation and vision.** The authors correctly identify a central problem in modern benchmarking: *saturation*. As top models quickly approach near-perfect scores on traditional datasets, automatic generation of new and harder tasks is a promising direction. The proposed “multi-agent” perspective—where one LLM proposes tasks, another solves them, and a third judges—captures the spirit of adaptive evaluation. Conceptually, this is aligned with current interest in *adversarial*, *cont
This paper is likely problematic, but due to lack to transparency (point 1), it is not easy to pin down the exact problem. I can explain my reasoning based on some circumstantial evidence (points 2-4). My best guess is that this paper makes the same mistake as in *Illusion of Thinking*, i.e., **misunderstanding contrived complication for difficulty with realistic value**. 1. **Lack of transparency in the generated tasks.** The most severe weakness is the *absence of released or described
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare · Explainable Artificial Intelligence (XAI)
