Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems
Sushant Mehta

TL;DR
This paper introduces CLEAR, a comprehensive multi-dimensional evaluation framework for enterprise agentic AI systems, addressing limitations of accuracy-only benchmarks by incorporating cost, reliability, security, and compliance metrics.
Contribution
It proposes the CLEAR framework, integrating multiple enterprise-relevant metrics, and demonstrates its effectiveness in better predicting real-world deployment success over traditional accuracy-focused evaluations.
Findings
Cost-aware evaluation reduces expenses by up to 10.8x compared to accuracy-only methods.
CLEAR correlates strongly (ρ=0.83) with production success, outperforming accuracy-only metrics (ρ=0.41).
Reliability drops significantly from single to multiple runs, highlighting the need for multidimensional assessment.
Abstract
Current agentic AI benchmarks predominantly evaluate task completion accuracy, while overlooking critical enterprise requirements such as cost-efficiency, reliability, and operational stability. Through systematic analysis of 12 main benchmarks and empirical evaluation of state-of-the-art agents, we identify three fundamental limitations: (1) absence of cost-controlled evaluation leading to 50x cost variations for similar precision, (2) inadequate reliability assessment where agent performance drops from 60\% (single run) to 25\% (8-run consistency), and (3) missing multidimensional metrics for security, latency, and policy compliance. We propose \textbf{CLEAR} (Cost, Latency, Efficacy, Assurance, Reliability), a holistic evaluation framework specifically designed for enterprise deployment. Evaluation of six leading agents on 300 enterprise tasks demonstrates that optimizing for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsEthics and Social Impacts of AI · Software System Performance and Reliability · Multi-Agent Systems and Negotiation
