Hardware-Friendly Diffusion Models with Fixed-Size Reusable Structures for On-Device Image Generation
Sanchar Palit, Sathya Veera Reddy Dendi, Mallikarjuna Talluri, Raj Narayana Gadde

TL;DR
This paper introduces a hardware-efficient diffusion model architecture with fixed-size, reusable blocks, eliminating positional embeddings, and demonstrating strong performance on resource-limited devices like mobile phones.
Contribution
It presents a novel fixed-size, token-free diffusion model architecture optimized for hardware deployment, addressing limitations of existing Transformer and U-Net based models.
Findings
Achieved a state-of-the-art FID score of 1.6 on CelebA.
Demonstrated consistent performance across unconditional and conditional tasks.
Model is highly suitable for mobile and resource-constrained devices.
Abstract
Vision Transformers and U-Net architectures have been widely adopted in the implementation of Diffusion Models. However, each architecture presents specific challenges while realizing them on-device. Vision Transformers require positional embedding to maintain correspondence between the tokens processed by the transformer, although they offer the advantage of using fixed-size, reusable repetitive blocks following tokenization. The U-Net architecture lacks these attributes, as it utilizes variable-sized intermediate blocks for down-convolution and up-convolution in the noise estimation backbone for the diffusion process. To address these issues, we propose an architecture that utilizes a fixed-size, reusable transformer block as a core structure, making it more suitable for hardware implementation. Our architecture is characterized by low complexity, token-free design, absence of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques
MethodsDiffusion · Concatenated Skip Connection · Max Pooling · Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · U-Net
