AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs
Xuanwen Ding, Chengjun Pan, Zejun Li, Jiwen Zhang, Siyuan Wang, Zhongyu Wei

TL;DR
AutoJudger is an adaptive, agent-driven framework that significantly reduces the cost of benchmarking multimodal large language models by selectively choosing questions based on difficulty and performance, achieving high accuracy with minimal data.
Contribution
This work introduces AutoJudger, a novel framework combining IRT and autonomous agents for efficient, adaptive benchmarking of MLLMs, reducing evaluation costs substantially.
Findings
AutoJudger uses only 4% of data for 90% accuracy.
It effectively covers diverse and challenging scenarios.
Reduces evaluation expenses dramatically.
Abstract
Evaluating multimodal large language models (MLLMs) is increasingly expensive, as the growing size and cross-modality complexity of benchmarks demand significant scoring efforts. To tackle with this difficulty, we introduce AutoJudger, an agent-driven framework for efficient and adaptive benchmarking of MLLMs that tackles this escalating cost. AutoJudger employs the Item Response Theory (IRT) to estimate the question difficulty and an autonomous evaluation agent to dynamically select the most informative test questions based on the model's real-time performance. Specifically, AutoJudger incorporates two pivotal components: a semantic-aware retrieval mechanism to ensure that selected questions cover diverse and challenging scenarios across both vision and language modalities, and a dynamic memory that maintains contextual statistics of previously evaluated questions to guide coherent and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Nicely presented paper, easy to follow, and well organized. The problem is interesting; the evaluation cost of MLLMs indeed often gets ignored. 2. The AutoJudger framework seems reasonable to me, and the idea of utilizing an autonomous framework to mine difficult questions for MLLM evaluation is worth promoting, especially since some recent works have pointed out that some benchmarks may suffer from data leakage or lower-quality question issues.
1. The need to trim down benchmark size **needs more solid evidence.** For example, the CO2 emission on a full benchmark, or a more straightforward measure such as GPU rental cost? * I know section 4.4 has already mentioned the cost, but it will be nice to see a number in $. * As mentioned in the summary, the evaluation cost can be trivial compared with the massive cost of training (including post-training/fine-tuning) of a model. 2. It is good to see that the author provides how differ
- The paper proposes a novel, agent-driven framework to address the practical problem of expensive MLLM evaluation costs. The AutoJudger framework is built on a principled and well-suited foundation, adapting Item Response Theory to define the problem difficulty and dynamically estimate the model's ability. - The design and implementation of the framework are solid and reasonable. The core components—real-time ability estimation, semantic-aware retrieval, an agent-based selection module, and a
**The most important issue:** The primary metric, "Ranking Accuracy" (Sec 4.1), seems to be a significant limitation in the evaluation. This ordinal metric only tells if the relative order of models is preserved and does not measure if the cardinal score gaps are retained. A framework that shrinks a 20-point performance gap (on the full benchmark) to a 1-point gap would still achieve 100% ranking accuracy, but users won't trust such a framework even if it's much more efficient and cheap. Also,
- The paper's primary originality lies in the principled reframing of MLLM benchmarking from static subset sampling to a dynamic, agent-driven "interview" process. This is a significant conceptual shift. The creative synthesis of established concepts from different fields—Item Response Theory (IRT) from psychometrics, semantic-aware retrieval, and a dynamic memory module—into a cohesive framework to solve this problem is highly novel and insightful. - The paper is exceptionally well-written and
- The framework's effectiveness is heavily reliant on the capability of the "interviewer" agent (Qwen2.5-VL-7B). This conflates the performance of the framework's mechanics (IRT, memory) with the reasoning power of a specific, strong MLLM. A weaker agent might make suboptimal choices, and the agent's inherent biases could lead to systematically skewed question selections. - The premise of using IRT requires pre-estimating question difficulties, which the authors accomplish by collecting respons
1. The work is comprehensive, supported by extensive experiments validating the effectiveness of the proposed approach. 2. The paper focuses on efficient MLLM benchmarking, which is an important and practical problem for the community.
1. While the proposed framework is interesting, the paper lacks a deeper analysis of the fundamental factors contributing to its effectiveness. Beyond the ablation study, it remains unclear which core design choices are primarily responsible for the improvement. This makes it difficult to disentangle whether the observed benefits stem from `essential ideas` or the `complex agentic workflow`. It would be helpful to include a more prototype-level or simplified implementation as an additional basel
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multi-Agent Systems and Negotiation · Text Readability and Simplification
