TL;DR
VibeServe introduces an agentic system that automatically synthesizes bespoke LLM serving stacks, outperforming generic systems in non-standard scenarios by leveraging generation-time specialization.
Contribution
It is the first end-to-end agentic loop that designs tailored LLM serving systems, demonstrating advantages over traditional general-purpose stacks.
Findings
VibeServe remains competitive with vLLM in standard deployment.
In non-standard scenarios, VibeServe outperforms existing systems.
Generation-time specialization can surpass runtime generality in infrastructure design.
Abstract
For years, we have built LLM serving systems like any other critical infrastructure: a single general-purpose stack, hand-tuned over many engineer-years, meant to support every model and workload. In this paper, we take the opposite bet: a multi-agent loop that automatically synthesizes bespoke serving systems for different usage scenarios. We propose VibeServe, the first agentic loop that generates entire LLM serving stacks end-to-end. VibeServe uses an outer loop to plan and track the search over system designs, and an inner loop to implement candidates, check correctness, and measure performance on the target benchmark. In the standard deployment setting, where existing stacks are highly optimized, VibeServe remains competitive with vLLM, showing that generation-time specialization need not come at the cost of performance. More interestingly, in non-standard scenarios, VibeServe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
