AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems

YenTing Lee; Keerthi Koneru; Zahra Moslemi; Sheethal Kumar; Ramesh Radhakrishnan

arXiv:2601.11903·cs.AI·January 21, 2026

AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems

YenTing Lee, Keerthi Koneru, Zahra Moslemi, Sheethal Kumar, Ramesh Radhakrishnan

PDF

Open Access

TL;DR

AEMA is a comprehensive framework designed to evaluate multi-agent LLM systems by providing transparent, stable, and human-aligned assessments that support responsible automation in enterprise scenarios.

Contribution

It introduces a process-aware, auditable evaluation framework that enhances stability, transparency, and automation for multi-agent LLM systems compared to traditional single-response methods.

Findings

01

AEMA improves evaluation stability and human alignment.

02

It offers traceable records for accountable automation.

03

Demonstrates effectiveness in realistic enterprise scenarios.

Abstract

Evaluating large language model (LLM)-based multi-agent systems remains a critical challenge, as these systems must exhibit reliable coordination, transparent decision-making, and verifiable performance across evolving tasks. Existing evaluation approaches often limit themselves to single-response scoring or narrow benchmarks, which lack stability, extensibility, and automation when deployed in enterprise settings at multi-agent scale. We present AEMA (Adaptive Evaluation Multi-Agent), a process-aware and auditable framework that plans, executes, and aggregates multi-step evaluations across heterogeneous agentic workflows under human oversight. Compared to a single LLM-as-a-Judge, AEMA achieves greater stability, human alignment, and traceable records that support accountable automation. Our results on enterprise-style agent workflows simulated using realistic business scenarios…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI