AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems
YenTing Lee, Keerthi Koneru, Zahra Moslemi, Sheethal Kumar, Ramesh Radhakrishnan

TL;DR
AEMA is a comprehensive framework designed to evaluate multi-agent LLM systems by providing transparent, stable, and human-aligned assessments that support responsible automation in enterprise scenarios.
Contribution
It introduces a process-aware, auditable evaluation framework that enhances stability, transparency, and automation for multi-agent LLM systems compared to traditional single-response methods.
Findings
AEMA improves evaluation stability and human alignment.
It offers traceable records for accountable automation.
Demonstrates effectiveness in realistic enterprise scenarios.
Abstract
Evaluating large language model (LLM)-based multi-agent systems remains a critical challenge, as these systems must exhibit reliable coordination, transparent decision-making, and verifiable performance across evolving tasks. Existing evaluation approaches often limit themselves to single-response scoring or narrow benchmarks, which lack stability, extensibility, and automation when deployed in enterprise settings at multi-agent scale. We present AEMA (Adaptive Evaluation Multi-Agent), a process-aware and auditable framework that plans, executes, and aggregates multi-step evaluations across heterogeneous agentic workflows under human oversight. Compared to a single LLM-as-a-Judge, AEMA achieves greater stability, human alignment, and traceable records that support accountable automation. Our results on enterprise-style agent workflows simulated using realistic business scenarios…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
