When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems

Donghao Huang; Gauri Malwe; Zhaoxia Wang

arXiv:2601.16280·cs.AI·January 26, 2026

When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems

Donghao Huang, Gauri Malwe, Zhaoxia Wang

PDF

Open Access

TL;DR

This paper presents a diagnostic framework using big data analytics to evaluate and improve tool invocation reliability in multi-agent LLM systems, addressing deployment challenges in privacy-sensitive environments.

Contribution

It introduces a comprehensive error taxonomy and systematic evaluation methodology for assessing tool-use reliability across diverse models and hardware configurations.

Findings

01

Tool initialization failures are the main bottleneck for smaller models.

02

Qwen2.5:32b matches GPT-4.1 in performance.

03

Mid-sized models offer a good accuracy-efficiency trade-off.

Abstract

Multi-agent systems powered by large language models (LLMs) are transforming enterprise automation, yet systematic evaluation methodologies for assessing tool-use reliability remain underdeveloped. We introduce a comprehensive diagnostic framework that leverages big data analytics to evaluate procedural reliability in intelligent agent systems, addressing critical needs for SME-centric deployment in privacy-sensitive environments. Our approach features a 12-category error taxonomy capturing failure modes across tool initialization, parameter handling, execution, and result interpretation. Through systematic evaluation of 1,980 deterministic test instances spanning both open-weight models (Qwen2.5 series, Functionary) and proprietary alternatives (GPT-4, Claude 3.5/3.7) across diverse edge hardware configurations, we identify actionable reliability thresholds for production deployment.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Big Data and Digital Economy · Adversarial Robustness in Machine Learning