AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Wentao Shi; Yu Wang; Yuyang Zhao; Yuxin Chen; Fuli Feng; Xueyuan Hao; Xi Su; Qi Gu; Hui Su; Xunliang Cai; Xiangnan He

arXiv:2604.18240·cs.AI·April 21, 2026

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Wentao Shi, Yu Wang, Yuyang Zhao, Yuxin Chen, Fuli Feng, Xueyuan Hao, Xi Su, Qi Gu, Hui Su, Xunliang Cai, Xiangnan He

PDF

2 Repos

TL;DR

AJ-Bench is a comprehensive benchmark for evaluating agent-based judges in environment-aware verification tasks across multiple domains, showing performance improvements over traditional LLM-based judges.

Contribution

The paper introduces AJ-Bench, a new benchmark for systematically assessing agent-as-a-Judge in diverse complex environments, highlighting its capabilities and challenges.

Findings

01

Agent-as-a-Judge outperforms LLM-as-a-Judge baselines.

02

AJ-Bench covers 155 tasks across three domains.

03

Substantial open challenges remain in agent-based verification.

Abstract

As reinforcement learning continues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers or LLM-as-a-Judge models, which struggle to generalize beyond narrow domains. Agent-as-a-Judge addresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored. We introduce a benchmark AJ-Bench to systematically evaluate Agent-as-a-Judge across three domains-search, data systems, and graphical user interfaces-comprising 155 tasks and 516 annotated trajectories. The benchmark comprehensively assesses judge agents' abilities in information acquisition, state verification, and process verification. Experiments demonstrate consistent performance gains over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.