Nonstandard Errors in AI Agents
Ruijiang Gao, Steven Chong Xiao

TL;DR
This paper investigates the variability in results produced by AI coding agents when analyzing the same data, revealing significant nonstandard errors and systematic differences in methodological choices, with implications for AI-driven research reliability.
Contribution
It introduces the concept of nonstandard errors in AI agents, demonstrating how methodological divergence affects empirical results and exploring how peer review influences convergence.
Findings
AI agents show substantial variation in analysis choices.
Different model families have stable, systematic differences.
Peer review has limited impact on reducing variability.
Abstract
We study whether state-of-the-art AI coding agents, given the same data and research question, produce the same empirical results. Deploying 150 autonomous Claude Code agents to independently test six hypotheses about market quality trends in NYSE TAQ data for SPY (2015--2024), we find that AI agents exhibit sizable \textit{nonstandard errors} (NSEs), that is, uncertainty from agent-to-agent variation in analytical choices, analogous to those documented among human researchers. AI agents diverge substantially on measure choice (e.g., autocorrelation vs.\ variance ratio, dollar vs.\ share volume). Different model families (Sonnet 4.6 vs.\ Opus 4.6) exhibit stable ``empirical styles,'' reflecting systematic differences in methodological preferences. In a three-stage feedback protocol, AI peer review (written critiques) has minimal effect on dispersion, whereas exposure to top-rated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Computational and Text Analysis Methods
