Beyond the Black Box: Interpretability of Agentic AI Tool Use

Hariom Tatsat; Ariye Shater

arXiv:2605.06890·cs.AI·May 22, 2026

Beyond the Black Box: Interpretability of Agentic AI Tool Use

Hariom Tatsat, Ariye Shater

PDF

TL;DR

This paper presents a mechanistic interpretability toolkit using Sparse Autoencoders and linear probes to diagnose and understand internal decision-making processes of AI agents during tool use, enhancing safety and reliability.

Contribution

It introduces a novel framework that infers internal states related to tool decisions in agentic AI, providing deeper insight into failures and enabling better internal observability.

Findings

01

The toolkit successfully identifies features associated with tool decisions.

02

Ablation of identified features impacts the model's tool use behavior.

03

Applied to multiple models, it reveals internal signals preceding tool calls.

Abstract

AI agents are promising for high-stakes enterprise workflows, but dependable deployment remains limited because tool-use failures are difficult to diagnose and control. Agents may skip required tool calls, invoke tools unnecessarily, or take actions whose consequence becomes visible only after execution. Existing observability methods are mostly external: prompts reveal correlations, evaluations score outputs, and logs arrive only after the model has already acted. In long-horizon settings, these failures are especially costly because an early tool mistake can alter the rest of the trajectory, increase token consumption, and create downstream safety and security risk. We introduce a mechanistic-interpretability toolkit built on Sparse Autoencoders (SAEs) and linear probes. The framework reads model states before each action and infers both whether a tool is needed and how…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.