Towards a Standard, Enterprise-Relevant Agentic AI Benchmark: Lessons from 5.5 billion tokens' worth of agentic AI evaluations
JV Roig

TL;DR
This paper introduces KAMI v0.1, an enterprise-focused agentic AI benchmark designed to reliably evaluate multi-step decision-making and tool use, addressing limitations of traditional benchmarks and providing insights for deployment.
Contribution
The paper presents KAMI v0.1, a new benchmark for enterprise-relevant agentic AI evaluation that overcomes contamination issues and better predicts real-world performance.
Findings
Traditional benchmarks poorly predict practical agentic performance.
Newer models do not always outperform older ones on enterprise tasks.
Cost-performance tradeoffs and reasoning capabilities significantly impact token efficiency.
Abstract
Enterprise adoption of agentic AI systems requires reliable evaluation methods that reflect real-world deployment scenarios. Traditional LLM benchmarks suffer from training data contamination and fail to measure agentic capabilities such as multi-step tool use and decision-making under uncertainty. We present the Kamiwaza Agentic Merit Index (KAMI) v0.1, an enterprise-focused benchmark that addresses both contamination resistance and agentic evaluation. Through 170,000 LLM test items processing over 5.5 billion tokens across 35 model configurations, we demonstrate that traditional benchmark rankings poorly predict practical agentic performance. Notably, newer generation models like Llama 4 or Qwen 3 do not always outperform their older generation variants on enterprise-relevant tasks, contradicting traditional benchmark trends. We also present insights on cost-performance tradeoffs,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Reinforcement Learning in Robotics
