Stochasticity in Agentic Evaluations: Quantifying Inconsistency with Intraclass Correlation
Zairah Mustahsan, Abel Lim, Megna Anand, Saahil Jain, Bryan McCann

TL;DR
This paper introduces the use of Intraclass Correlation Coefficient (ICC) to measure and improve the reliability of evaluations of large language models in agentic systems, emphasizing the importance of accounting for variance and measurement noise.
Contribution
It proposes adopting ICC for evaluation reliability, decomposes variance sources, and provides guidelines for sample size and reporting practices to enhance trustworthy benchmarking.
Findings
ICC varies significantly with task structure and model type.
ICC converges with 8-16 trials for structured tasks and 32 for complex reasoning.
Reporting ICC alongside accuracy improves evaluation transparency.
Abstract
As large language models become components of larger agentic systems, evaluation reliability becomes critical: unreliable sub-agents introduce brittleness into downstream system behavior. Yet current evaluation practice, reporting a single accuracy number from a single run, obscures the variance underlying these results, making it impossible to distinguish genuine capability improvements from lucky sampling. We propose adopting Intraclass Correlation Coefficient (ICC), a metric from measurement science, to characterize this variance. ICC decomposes observed variance into between-query variance (task difficulty) and within-query variance (agent inconsistency), highlighting whether reported results reflect true capability or measurement noise. We evaluated on GAIA (Levels 1-3, measuring agentic capabilities across varying reasoning complexity) and FRAMES (measuring retrieval and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Multimodal Machine Learning Applications
