ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

Shir Ashury-Tahan; Yifan Mai; Elron Bandel; Michal Shmueli-Scheuer; Leshem Choshen

arXiv:2601.15812·cs.AI·February 18, 2026

ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

Shir Ashury-Tahan, Yifan Mai, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen

PDF

Open Access 3 Reviews

TL;DR

ErrorMap and ErrorAtlas provide a novel framework for understanding the underlying causes of failures in large language models, enabling more targeted debugging and evaluation beyond traditional success metrics.

Contribution

We introduce ErrorMap, a method to identify and analyze the sources of LLM failures, and ErrorAtlas, a comprehensive taxonomy of error types across multiple datasets and models.

Findings

01

ErrorAtlas reveals common failure patterns like omission and misinterpretation.

02

ErrorMap can be applied to any model or dataset for failure analysis.

03

The approach uncovers underexplored error types in LLM research.

Abstract

Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail. A wrong answer on a reasoning dataset may stem from formatting issues, calculation errors, or dataset noise rather than weak reasoning. Without disentangling such causes, benchmarks remain incomplete and cannot reliably guide model improvement. We introduce ErrorMap, the first method to chart the sources of LLM failure. It extracts a model's unique "failure signature", clarifies what benchmarks measure, and broadens error identification to reduce blind spots. This helps developers debug models, aligns benchmark goals with outcomes, and supports informed model selection. ErrorMap works on any model or dataset with the same logic. Applying our method to 35 datasets and 83 models we generate ErrorAtlas, a taxonomy of model errors, revealing recurring failure patterns. ErrorAtlas highlights error types…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- This work approaches LLM evaluation from a new perspective—explaining why models fail rather than merely identifying when they fail. By constructing a systematic error analysis framework, the authors contribute a method that is both theoretically meaningful and practically valuable. It helps model developers identify weaknesses, reveal capability gaps, and provides a scientific foundation for model improvement and evaluation. - The core framework, ErrorMap, is conceptually clear and logically

Weaknesses

- The paper’s main contribution is a tool-based framework for analyzing LLM errors, whose greatest value lies in broad adoption. However, the authors have not yet released the code , making it impossible to evaluate the system’s real-world performance, stability, and efficiency.In my opinion,this limitation significantly weakens the practical impact and overall value of the work. - The per-instance diagnostic stage of ErrorMap relies heavily on an LLM-as-judge mechanism. Experimental results sho

Reviewer 02Rating 4Confidence 5

Strengths

1. The paper introduces a clear general pipeline for systematic error analysis. 2. The error atlas spans about 21 datasets and many models which enables cross-model and cross-benchmark comparisons. 3. There is a concrete taxonomy with readable category names and definitions which is useful.

Weaknesses

The usage of LLMs to judge LLMs failure modes is fundamentally problematic. And while the authors acknowledge this, it does introduce systematic blind spots with the judge model shares failure modes with evaluated models. The validation is not extremely strong. The 92% taxonomy accuracy comes from the same LLM judge and only 53% similarity occurs across prompt variations which suggest some instability. If the authors had done some Newman evaluation and had published agreement scores between the

Reviewer 03Rating 2Confidence 3

Strengths

N/A

Weaknesses

N/A

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Computational and Text Analysis Methods