Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation

Roshita Bhonsle; Rishav Dutta; Sneha Vavilapalli; Harsh Seth; Abubakarr Jaye; Yapei Chang; Mukund Rungta; Emmanuel Aboah Boateng; Sadid Hasan; Ehi Nosakhare; Soundar Srinivasan

arXiv:2508.05508·cs.AI·August 8, 2025

Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation

Roshita Bhonsle, Rishav Dutta, Sneha Vavilapalli, Harsh Seth, Abubakarr Jaye, Yapei Chang, Mukund Rungta, Emmanuel Aboah Boateng, Sadid Hasan, Ehi Nosakhare, Soundar Srinivasan

PDF

TL;DR

This paper introduces a modular, human-like evaluation framework for assessing agent task completion that considers step-by-step reasoning, outperforming existing methods in aligning with human judgments across multiple benchmarks.

Contribution

The paper presents a novel, domain-independent evaluation framework that decomposes tasks into sub-tasks and validates each step, improving alignment with human evaluations.

Findings

01

Achieves 4.76% higher alignment accuracy on GAIA

02

Achieves 10.52% higher alignment accuracy on BigCodeBench

03

Outperforms GPT-4-based LLM-as-a-Judge baseline

Abstract

The increasing adoption of foundation models as agents across diverse domains necessitates a robust evaluation framework. Current methods, such as LLM-as-a-Judge, focus only on final outputs, overlooking the step-by-step reasoning that drives agentic decision-making. Meanwhile, existing Agent-as-a-Judge systems, where one agent evaluates another's task completion, are typically designed for narrow, domain-specific settings. To address this gap, we propose a generalizable, modular framework for evaluating agent task completion independent of the task domain. The framework emulates human-like evaluation by decomposing tasks into sub-tasks and validating each step using available information, such as the agent's output and reasoning. Each module contributes to a specific aspect of the evaluation process, and their outputs are aggregated to produce a final verdict on task completion. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.