
TL;DR
This paper introduces a novel LLM serving system architecture that uses programmable inference programs, enabling greater flexibility, efficiency, and extensibility for complex applications compared to traditional prompt-based systems.
Contribution
The paper proposes LLM Inference Programs (LIPs) and a system called Symphony that serve as an operating system for LIPs, improving flexibility and efficiency in LLM serving systems.
Findings
Symphony virtualizes KV cache with a dedicated file system.
Symphony employs a two-level process scheduling scheme for GPU efficiency.
LIPs enable runtime customization and offloading of application logic.
Abstract
Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible design. We propose a new LLM serving system architecture that serves programs instead of prompts to address this problem. These programs, called LLM Inference Programs (LIPs), allow users to customize token prediction and KV cache management at runtime and to offload parts of their application logic, such as tool execution, to the server. We describe an example of this architecture through a system named Symphony, which functions as an operating system for LIPs. Symphony exposes LLM model computations via system calls and virtualizes KV cache with a dedicated file system, while ensuring GPU efficiency with a two-level process scheduling scheme. Symphony has the potential to open the door to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
