Pie: A Programmable Serving System for Emerging LLM Applications
In Gim, Zhiyao Ma, Seung-seob Lee, Lin Zhong

TL;DR
Pie is a flexible, programmable serving system for large language models that allows custom workflows and optimizations, improving latency and throughput for emerging LLM applications.
Contribution
Pie introduces a novel programmable serving architecture using inferlets and WebAssembly, enabling customizable LLM workflows without system modifications.
Findings
Matches state-of-the-art performance with minimal latency overhead
Significantly improves latency and throughput on agentic workflows
Enables application-specific optimizations for LLM serving
Abstract
Emerging large language model (LLM) applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing serving systems built on a monolithic token generation loop. This paper introduces Pie, a programmable LLM serving system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O-entirely within the application, without requiring modifications to the serving system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
