Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

Aman Mehta

arXiv:2603.25764·cs.SE·April 6, 2026

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

Aman Mehta

PDF

TL;DR

This paper investigates how behavioral consistency in large language model agents correlates with accuracy, revealing that consistency amplifies outcomes but does not ensure correctness, impacting deployment and evaluation strategies.

Contribution

It provides empirical analysis of consistency and accuracy across multiple models on a complex benchmark, highlighting the nuanced role of consistency in AI reliability.

Findings

01

Higher consistency correlates with higher accuracy across models.

02

Consistency can amplify both correct and incorrect outcomes.

03

Most failures are due to consistent wrong interpretations.

Abstract

As LLM-based AI agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude~4.5~Sonnet, GPT-5, and Llama-3.1-70B across 50 runs each (10 tasks $\times$ 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest variance (CV: 15.2\%) and highest accuracy (58\%), GPT-5 is intermediate (CV: 32.2\%, accuracy: 32\%), and Llama shows the highest variance (CV: 47.0\%) with lowest accuracy (4\%). However, within a model, consistency can amplify both correct and incorrect interpretations. Our analysis reveals a critical nuance: \textbf{consistency amplifies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.