SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices   with Efficient Architectures and Training

Dongting Hu; Jierun Chen; Xijie Huang; Huseyin Coskun; Arpit Sahni,; Aarush Gupta; Anujraaj Goyal; Dishani Lahiri; Rajesh Singh; Yerlan Idelbayev,; Junli Cao; Yanyu Li; Kwang-Ting Cheng; S.-H. Gary Chan; Mingming Gong; Sergey; Tulyakov; Anil Kag; Yanwu Xu; Jian Ren

arXiv:2412.09619·cs.CV·December 13, 2024

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

Dongting Hu, Jierun Chen, Xijie Huang, Huseyin Coskun, Arpit Sahni,, Aarush Gupta, Anujraaj Goyal, Dishani Lahiri, Rajesh Singh, Yerlan Idelbayev,, Junli Cao, Yanyu Li, Kwang-Ting Cheng, S.-H. Gary Chan, Mingming Gong, Sergey, Tulyakov, Anil Kag, Yanwu Xu, Jian Ren

PDF

Open Access

TL;DR

SnapGen introduces a highly efficient, small, and fast text-to-image diffusion model capable of generating high-resolution images on mobile devices, surpassing larger models in quality and speed.

Contribution

The paper presents novel architecture design, knowledge distillation, and adversarial guidance techniques to enable high-quality, high-resolution image generation on mobile with a tiny model.

Findings

01

Generates 1024x1024 images on mobile in 1.4 seconds.

02

Achieves an FID of 2.06 on ImageNet-1K with only 372M parameters.

03

Outperforms larger models like SDXL and IF-XL in size and quality.

Abstract

Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmbedded Systems Design Techniques · Advanced Malware Detection Techniques · Parallel Computing and Optimization Techniques

MethodsDiffusion · Knowledge Distillation