GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads
Fanjiang Ye, Zhangke Li, Xinrui Zhong, Ethan Ma, Russell Chen, Kaijian Wang, Jingwei Zuo, Desen Sun, Ye Cao, Triston Cao, Myungjin Lee, Arvind Krishnamurthy, Yuke Wang

TL;DR
GENSERVE is a system designed to efficiently co-serve heterogeneous diffusion model workloads like T2I and T2V on shared GPU clusters, optimizing resource management and meeting latency SLOs.
Contribution
It introduces step-level resource adaptation leveraging the predictability of diffusion inference, enabling better handling of diverse workloads.
Findings
GENSERVE improves SLO attainment rate by up to 44%.
It effectively manages heterogeneous diffusion workloads on shared GPU clusters.
The system utilizes step boundary preemption and dynamic batching for efficiency.
Abstract
Diffusion models have emerged as the prevailing approach for text-to-image (T2I) and text-to-video (T2V) generation, yet production platforms must increasingly serve both modalities on shared GPU clusters while meeting stringent latency SLOs. Co-serving such heterogeneous workloads is challenging: T2I and T2V requests exhibit vastly different compute demands, parallelism characteristics, and latency requirements, leading to significant SLO violations in existing serving systems. We present GENSERVE, a co-serving system that leverages the inherent predictability of the diffusion process to optimize serving efficiency. A central insight is that diffusion inference proceeds in discrete, predictable steps and is naturally preemptible at step boundaries, opening a new design space for heterogeneity-aware resource management. GENSERVE introduces step-level resource adaptation through three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
