Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy
Aman Mehta

TL;DR
This paper investigates how behavioral consistency in large language model agents correlates with accuracy, revealing that consistency amplifies outcomes but does not ensure correctness, impacting deployment and evaluation strategies.
Contribution
It provides empirical analysis of consistency and accuracy across multiple models on a complex benchmark, highlighting the nuanced role of consistency in AI reliability.
Findings
Higher consistency correlates with higher accuracy across models.
Consistency can amplify both correct and incorrect outcomes.
Most failures are due to consistent wrong interpretations.
Abstract
As LLM-based AI agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude~4.5~Sonnet, GPT-5, and Llama-3.1-70B across 50 runs each (10 tasks 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest variance (CV: 15.2\%) and highest accuracy (58\%), GPT-5 is intermediate (CV: 32.2\%, accuracy: 32\%), and Llama shows the highest variance (CV: 47.0\%) with lowest accuracy (4\%). However, within a model, consistency can amplify both correct and incorrect interpretations. Our analysis reveals a critical nuance: \textbf{consistency amplifies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
