SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two   Seconds

Yanyu Li; Huan Wang; Qing Jin; Ju Hu; Pavlo Chemerys; Yun Fu; Yanzhi; Wang; Sergey Tulyakov; Jian Ren

arXiv:2306.00980·cs.CV·October 17, 2023·35 cites

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi, Wang, Sergey Tulyakov, Jian Ren

PDF

Open Access 1 Video

TL;DR

SnapFusion introduces a highly efficient text-to-image diffusion model capable of generating high-quality images on mobile devices in under two seconds, making advanced AI art accessible and private.

Contribution

The paper presents a novel, optimized UNet architecture and step distillation techniques that enable fast, high-quality image generation on mobile devices, a significant improvement over existing models.

Findings

01

Achieves under 2 seconds inference on mobile devices

02

Outperforms Stable Diffusion v1.5 with fewer steps

03

Maintains high image quality with improved FID and CLIP scores

Abstract

Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computational and Text Analysis Methods

MethodsDiffusion · Contrastive Language-Image Pre-training