MACEval: A Multi-Agent Continual Evaluation Network for Large Models

Zijian Chen; Yuze Sun; Yuan Tian; Wenjun Zhang; Guangtao Zhai

arXiv:2511.09139·cs.CV·February 2, 2026

MACEval: A Multi-Agent Continual Evaluation Network for Large Models

Zijian Chen, Yuze Sun, Yuan Tian, Wenjun Zhang, Guangtao Zhai

PDF

Open Access 3 Reviews

TL;DR

MACEval introduces a dynamic, multi-agent evaluation framework for large models that addresses overfitting, maintenance challenges, and scalability issues of traditional benchmarks, enabling more efficient and longitudinal performance assessment.

Contribution

We propose MACEval, a novel multi-agent system for continual, autonomous evaluation of large models, reducing manual effort and improving adaptability over existing static benchmarks.

Findings

01

Effective evaluation across 23 large models

02

Reduces evaluation overhead significantly

03

Enables longitudinal performance tracking

Abstract

Hundreds of benchmarks dedicated to evaluating large models have been presented over the past few years. However, most of them remain closed-ended and are prone to overfitting due to the potential data contamination. Moreover, the increasing scale and scope of current benchmarks with transient metrics, as well as the heavily human-dependent curation procedure, pose significant challenges for timely maintenance and adaptation. In this paper, we introduce MACEval, a Multi-Agent Continual Evaluation network for dynamic evaluation of large models, and define new metrics to quantify performance longitudinally. MACEval employs an interactive and autonomous evaluation mode, utilizing role assignment, in-process data generation, and evaluation routing through a cascaded agent network. Extensive experiments on 23 large models demonstrate the effectiveness of MACEval, which also lightens the…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

Good motivation

Weaknesses

See summary

Reviewer 02Rating 2Confidence 4

Strengths

Automated eval methods are promising and an important avenue of work. This paper proposes a workflow to achieve automated eval results for LLMs.

Weaknesses

Missing (very) related work: https://arxiv.org/pdf/2312.14856 https://arxiv.org/abs/2310.17567 https://arxiv.org/pdf/2502.06453 https://alignment.anthropic.com/2025/petri/ - It would be good to reflect on the quantity vs quality argument. - This is presented as an automated eval method but humans have to manually come up with the tasks (only 9 introduced in the paper). E.g. construct an eval where you can iteratively add more noise to an image. While this is fine (the above linked papers do sim

Reviewer 03Rating 2Confidence 4

Strengths

1. **Motivation and vision.** The authors correctly identify a central problem in modern benchmarking: *saturation*. As top models quickly approach near-perfect scores on traditional datasets, automatic generation of new and harder tasks is a promising direction. The proposed “multi-agent” perspective—where one LLM proposes tasks, another solves them, and a third judges—captures the spirit of adaptive evaluation. Conceptually, this is aligned with current interest in *adversarial*, *cont

Weaknesses

This paper is likely problematic, but due to lack to transparency (point 1), it is not easy to pin down the exact problem. I can explain my reasoning based on some circumstantial evidence (points 2-4). My best guess is that this paper makes the same mistake as in *Illusion of Thinking*, i.e., **misunderstanding contrived complication for difficulty with realistic value**. 1. **Lack of transparency in the generated tasks.** The most severe weakness is the *absence of released or described

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare · Explainable Artificial Intelligence (XAI)