How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations
JV Roig

TL;DR
This paper analyzes how large language models fail as autonomous agents in tool-use scenarios, revealing that size alone doesn't ensure robustness and identifying key failure modes through detailed behavioral analysis.
Contribution
It introduces a fine-grained analysis of LLM failures in agentic tasks, highlighting the importance of training and design choices beyond model size for reliable deployment.
Findings
Model size does not predict robustness in agentic tasks.
DeepSeek V3.1's reliability stems from reinforcement learning, not architecture.
Four common failure archetypes identified across models.
Abstract
We investigate how large language models (LLMs) fail when operating as autonomous agents with tool-use capabilities. Using the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, we analyze 900 execution traces from three representative models - Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1 - across filesystem, text extraction, CSV analysis, and SQL scenarios. Rather than focusing on aggregate scores, we perform fine-grained, per-trial behavioral analysis to surface the strategies that enable successful multi-step tool execution and the recurrent failure modes that undermine reliability. Our findings show that model scale alone does not predict agentic robustness: Llama 4 Maverick (400B) performs only marginally better than Granite 4 Small (32B) in some uncertainty-driven tasks, while DeepSeek V3.1's superior reliability derives primarily from post-training reinforcement learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Scientific Computing and Data Management · Artificial Intelligence in Law
