Correlated Errors in Large Language Models

Elliot Kim; Avi Garg; Kenny Peng; Nikhil Garg

arXiv:2506.07962·cs.CL·June 10, 2025

Correlated Errors in Large Language Models

Elliot Kim, Avi Garg, Kenny Peng, Nikhil Garg

PDF

Open Access 1 Video

TL;DR

This study empirically examines error correlations among over 350 large language models, revealing significant shared mistakes influenced by architecture and provider, with implications for diversity and robustness in AI systems.

Contribution

It provides the first large-scale empirical analysis of error correlation in LLMs, highlighting factors influencing model similarity and potential risks of algorithmic monoculture.

Findings

01

Models agree 60% of the time on errors.

02

Shared architecture and provider increase error correlation.

03

Larger, more accurate models have highly correlated errors.

Abstract

Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ meaningfully. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors -- on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring -- the latter reflecting theoretical predictions regarding algorithmic monoculture.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Correlated Errors in Large Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Text Readability and Simplification