Optimizing Agentic Language Model Inference via Speculative Tool Calls
Daniel Nichols, Prajwal Singhania, Charles Jekel, Abhinav Bhatele, Harshitha Menon

TL;DR
This paper presents system optimizations for language model inference that speculate tool calls and keep sequences resident, significantly improving throughput for agentic language models using external tools.
Contribution
It introduces novel speculation-based optimizations and a tool cache API to reduce inference bottlenecks in tool-using language models.
Findings
Throughput improved by several hundred tokens per second
Theoretical analysis guides optimal speculation configurations
Proposed API facilitates adoption of optimizations
Abstract
Language models (LMs) are becoming increasingly dependent on external tools. LM-based agentic frameworks frequently interact with their environment via such tools to search files, run code, call APIs, etc. Further, modern reasoning-based LMs use tools such as web search and Python code execution to enhance their reasoning capabilities. While tools greatly improve the capabilities of LMs, they also introduce performance bottlenecks during the inference process. In this paper, we introduce novel systems optimizations to address such performance bottlenecks by speculating tool calls and forcing sequences to remain resident in the inference engine to minimize overheads. Our optimizations lead to throughput improvements of several hundred tokens per second when hosting inference for LM agents. We provide a theoretical analysis of our algorithms to provide insights into speculation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Natural Language Processing Techniques · Big Data and Digital Economy
