Sutradhara: An Intelligent Orchestrator-Engine Co-design for Tool-based Agentic Inference
Anish Biswas, Kanishk Goel, Srivarshinee S, Jayashree Mohan, Alind Khare, Anjaly Parayil, Ramachandran Ramjee, Chetan Bansal

TL;DR
Sutradhara is a co-designed system that optimizes tool-based agentic inference by integrating orchestration with LLM serving, reducing latency and increasing throughput in complex AI workloads.
Contribution
It introduces a novel API and three key optimizations that enable cross-layer improvements in tool invocation, caching, and parallelism for agentic LLM applications.
Findings
Up to 77% higher load capacity at same latency
Median FTR latency reduced by up to 15%
End-to-end latency decreased by up to 11% on A100 GPUs
Abstract
Agentic applications are LLMs that iteratively invoke external tools to accomplish complex tasks. Such tool-based agents are rapidly becoming the dominant paradigm for deploying language models in production. Unlike traditional single-turn inference, agentic workloads chain together multiple LLM calls and tool executions before producing a final response, creating a new performance bottleneck that manifests as increased latency in First Token Rendered (FTR) of the final answer. Through analysis of requests at production scale, we reveal three critical challenges: tool calls account for 30-85% of FTR latency, KV cache hit rates collapse despite substantial context reuse across iterations, and sequential orchestration wastes potential intra-request parallelism. These bottlenecks stem from a design gap in which orchestrators and LLM engines operate as decoupled black boxes, preventing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
