Approximate Caching for Efficiently Serving Diffusion Models

Shubham Agarwal; Subrata Mitra; Sarthak Chakraborty; Srikrishna; Karanam; Koyel Mukherjee; Shiv Saini

arXiv:2312.04429·cs.CV·December 8, 2023·1 cites

Approximate Caching for Efficiently Serving Diffusion Models

Shubham Agarwal, Subrata Mitra, Sarthak Chakraborty, Srikrishna, Karanam, Koyel Mukherjee, Shiv Saini

PDF

Open Access

TL;DR

This paper introduces approximate-caching to reduce resource use and latency in diffusion model-based text-to-image generation by reusing intermediate states, demonstrated through the Nirvana system with significant savings.

Contribution

The paper proposes a novel approximate-caching technique with a new cache management policy, improving efficiency and reducing costs in production diffusion model serving.

Findings

01

19.8% reduction in end-to-end latency

02

19% cost savings on average

03

Effective reuse of intermediate states in real workloads

Abstract

Text-to-image generation using diffusion models has seen explosive popularity owing to their ability in producing high quality images adhering to text prompts. However, production-grade diffusion model serving is a resource intensive task that not only require high-end GPUs which are expensive but also incurs considerable latency. In this paper, we introduce a technique called approximate-caching that can reduce such iterative denoising steps for an image generation based on a prompt by reusing intermediate noise states created during a prior image generation for similar prompts. Based on this idea, we present an end to end text-to-image system, Nirvana, that uses the approximate-caching with a novel cache management-policy Least Computationally Beneficial and Frequently Used (LCBFU) to provide % GPU compute savings, 19.8% end-to-end latency reduction and 19% dollar savings, on average,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Recommender Systems and Techniques · Caching and Content Delivery

MethodsDiffusion