Beyond the Black Box: Interpretability of Agentic AI Tool Use
Hariom Tatsat, Ariye Shater

TL;DR
This paper presents a mechanistic interpretability toolkit using Sparse Autoencoders and linear probes to diagnose and understand internal decision-making processes of AI agents during tool use, enhancing safety and reliability.
Contribution
It introduces a novel framework that infers internal states related to tool decisions in agentic AI, providing deeper insight into failures and enabling better internal observability.
Findings
The toolkit successfully identifies features associated with tool decisions.
Ablation of identified features impacts the model's tool use behavior.
Applied to multiple models, it reveals internal signals preceding tool calls.
Abstract
AI agents are promising for high-stakes enterprise workflows, but dependable deployment remains limited because tool-use failures are difficult to diagnose and control. Agents may skip required tool calls, invoke tools unnecessarily, or take actions whose consequence becomes visible only after execution. Existing observability methods are mostly external: prompts reveal correlations, evaluations score outputs, and logs arrive only after the model has already acted. In long-horizon settings, these failures are especially costly because an early tool mistake can alter the rest of the trajectory, increase token consumption, and create downstream safety and security risk. We introduce a mechanistic-interpretability toolkit built on Sparse Autoencoders (SAEs) and linear probes. The framework reads model states before each action and infers both whether a tool is needed and how…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
