ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models
Shir Ashury-Tahan, Yifan Mai, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen

TL;DR
ErrorMap and ErrorAtlas provide a novel framework for understanding the underlying causes of failures in large language models, enabling more targeted debugging and evaluation beyond traditional success metrics.
Contribution
We introduce ErrorMap, a method to identify and analyze the sources of LLM failures, and ErrorAtlas, a comprehensive taxonomy of error types across multiple datasets and models.
Findings
ErrorAtlas reveals common failure patterns like omission and misinterpretation.
ErrorMap can be applied to any model or dataset for failure analysis.
The approach uncovers underexplored error types in LLM research.
Abstract
Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail. A wrong answer on a reasoning dataset may stem from formatting issues, calculation errors, or dataset noise rather than weak reasoning. Without disentangling such causes, benchmarks remain incomplete and cannot reliably guide model improvement. We introduce ErrorMap, the first method to chart the sources of LLM failure. It extracts a model's unique "failure signature", clarifies what benchmarks measure, and broadens error identification to reduce blind spots. This helps developers debug models, aligns benchmark goals with outcomes, and supports informed model selection. ErrorMap works on any model or dataset with the same logic. Applying our method to 35 datasets and 83 models we generate ErrorAtlas, a taxonomy of model errors, revealing recurring failure patterns. ErrorAtlas highlights error types…
Peer Reviews
Decision·Submitted to ICLR 2026
- This work approaches LLM evaluation from a new perspective—explaining why models fail rather than merely identifying when they fail. By constructing a systematic error analysis framework, the authors contribute a method that is both theoretically meaningful and practically valuable. It helps model developers identify weaknesses, reveal capability gaps, and provides a scientific foundation for model improvement and evaluation. - The core framework, ErrorMap, is conceptually clear and logically
- The paper’s main contribution is a tool-based framework for analyzing LLM errors, whose greatest value lies in broad adoption. However, the authors have not yet released the code , making it impossible to evaluate the system’s real-world performance, stability, and efficiency.In my opinion,this limitation significantly weakens the practical impact and overall value of the work. - The per-instance diagnostic stage of ErrorMap relies heavily on an LLM-as-judge mechanism. Experimental results sho
1. The paper introduces a clear general pipeline for systematic error analysis. 2. The error atlas spans about 21 datasets and many models which enables cross-model and cross-benchmark comparisons. 3. There is a concrete taxonomy with readable category names and definitions which is useful.
The usage of LLMs to judge LLMs failure modes is fundamentally problematic. And while the authors acknowledge this, it does introduce systematic blind spots with the judge model shares failure modes with evaluated models. The validation is not extremely strong. The 92% taxonomy accuracy comes from the same LLM judge and only 53% similarity occurs across prompt variations which suggest some instability. If the authors had done some Newman evaluation and had published agreement scores between the
N/A
N/A
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Computational and Text Analysis Methods
