Sutradhara: An Intelligent Orchestrator-Engine Co-design for Tool-based Agentic Inference

Anish Biswas; Kanishk Goel; Srivarshinee S; Jayashree Mohan; Alind Khare; Anjaly Parayil; Ramachandran Ramjee; Chetan Bansal

arXiv:2601.12967·cs.DC·April 23, 2026

Sutradhara: An Intelligent Orchestrator-Engine Co-design for Tool-based Agentic Inference

Anish Biswas, Kanishk Goel, Srivarshinee S, Jayashree Mohan, Alind Khare, Anjaly Parayil, Ramachandran Ramjee, Chetan Bansal

PDF

TL;DR

Sutradhara is a co-designed system that optimizes tool-based agentic inference by integrating orchestration with LLM serving, reducing latency and increasing throughput in complex AI workloads.

Contribution

It introduces a novel API and three key optimizations that enable cross-layer improvements in tool invocation, caching, and parallelism for agentic LLM applications.

Findings

01

Up to 77% higher load capacity at same latency

02

Median FTR latency reduced by up to 15%

03

End-to-end latency decreased by up to 11% on A100 GPUs

Abstract

Agentic applications are LLMs that iteratively invoke external tools to accomplish complex tasks. Such tool-based agents are rapidly becoming the dominant paradigm for deploying language models in production. Unlike traditional single-turn inference, agentic workloads chain together multiple LLM calls and tool executions before producing a final response, creating a new performance bottleneck that manifests as increased latency in First Token Rendered (FTR) of the final answer. Through analysis of requests at production scale, we reveal three critical challenges: tool calls account for 30-85% of FTR latency, KV cache hit rates collapse despite substantial context reuse across iterations, and sequential orchestration wastes potential intra-request parallelism. These bottlenecks stem from a design gap in which orchestrators and LLM engines operate as decoupled black boxes, preventing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.