Kairos: Low-latency Multi-Agent Serving with Shared LLMs and Excessive Loads in the Public Cloud
Jinyuan Chen, Jiuchen Shi, Quan Chen, Minyi Guo

TL;DR
Kairos is a system designed to optimize multi-agent workflows using shared large language models, significantly reducing latency by intelligently scheduling and dispatching requests based on latency and memory demands.
Contribution
Kairos introduces a novel multi-agent orchestration system with a workflow-aware scheduler and memory-aware dispatcher to improve latency and resource utilization in shared LLM environments.
Findings
Reduces end-to-end latency by up to 28.4%.
Improves request scheduling based on latency characteristics.
Avoids GPU overloading through memory-aware dispatching.
Abstract
Multi-agent applications utilize the advanced capabilities of large language models (LLMs) for intricate task completion through agent collaboration in a workflow. Under this situation, requests from different agents usually access the same shared LLM to perform different kinds of tasks, forcing the shared LLM to suffer excessive loads. However, existing works have low serving performance for these multi-agent applications, mainly due to the ignorance of inter-agent latency and resource differences for request scheduling. We therefore propose Kairos, a multi-agent orchestration system that optimizes end-to-end latency for multi-agent applications. Kairos consists of a workflow orchestrator, a workflow-aware priority scheduler, and a memory-aware dispatcher. The orchestrator collects agent-specific information for online workflow analysis. The scheduler decides the serving priority of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Cloud Computing and Resource Management · IoT and Edge/Fog Computing
