TL;DR
This paper introduces Tool Attention, a middleware mechanism that significantly reduces token overhead in LLM agent workflows by gating tool access and lazy schema loading, improving efficiency and context utilization.
Contribution
It proposes a novel Tool Attention method that generalizes self-attention to tools, combining intent scoring, gating, and lazy schema loading to optimize LLM external tool integration.
Findings
Reduces per-turn tool tokens by 95% in simulations
Increases effective context utilization from 24% to 91%
Supports protocol-level efficiency as a key constraint in scalable systems
Abstract
The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware-layer mechanism that generalizes the "Attention Is All You Need" paradigm from self-attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state-aware gating function enforcing preconditions and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
