Correlated Errors in Large Language Models
Elliot Kim, Avi Garg, Kenny Peng, Nikhil Garg

TL;DR
This study empirically examines error correlations among over 350 large language models, revealing significant shared mistakes influenced by architecture and provider, with implications for diversity and robustness in AI systems.
Contribution
It provides the first large-scale empirical analysis of error correlation in LLMs, highlighting factors influencing model similarity and potential risks of algorithmic monoculture.
Findings
Models agree 60% of the time on errors.
Shared architecture and provider increase error correlation.
Larger, more accurate models have highly correlated errors.
Abstract
Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ meaningfully. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors -- on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring -- the latter reflecting theoretical predictions regarding algorithmic monoculture.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Text Readability and Simplification
