Hardware-Friendly Diffusion Models with Fixed-Size Reusable Structures for On-Device Image Generation

Sanchar Palit; Sathya Veera Reddy Dendi; Mallikarjuna Talluri; Raj Narayana Gadde

arXiv:2411.06119·cs.CV·September 5, 2025

Hardware-Friendly Diffusion Models with Fixed-Size Reusable Structures for On-Device Image Generation

Sanchar Palit, Sathya Veera Reddy Dendi, Mallikarjuna Talluri, Raj Narayana Gadde

PDF

Open Access

TL;DR

This paper introduces a hardware-efficient diffusion model architecture with fixed-size, reusable blocks, eliminating positional embeddings, and demonstrating strong performance on resource-limited devices like mobile phones.

Contribution

It presents a novel fixed-size, token-free diffusion model architecture optimized for hardware deployment, addressing limitations of existing Transformer and U-Net based models.

Findings

01

Achieved a state-of-the-art FID score of 1.6 on CelebA.

02

Demonstrated consistent performance across unconditional and conditional tasks.

03

Model is highly suitable for mobile and resource-constrained devices.

Abstract

Vision Transformers and U-Net architectures have been widely adopted in the implementation of Diffusion Models. However, each architecture presents specific challenges while realizing them on-device. Vision Transformers require positional embedding to maintain correspondence between the tokens processed by the transformer, although they offer the advantage of using fixed-size, reusable repetitive blocks following tokenization. The U-Net architecture lacks these attributes, as it utilizes variable-sized intermediate blocks for down-convolution and up-convolution in the noise estimation backbone for the diffusion process. To address these issues, we propose an architecture that utilizes a fixed-size, reusable transformer block as a core structure, making it more suitable for hardware implementation. Our architecture is characterized by low complexity, token-free design, absence of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputer Graphics and Visualization Techniques · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques

MethodsDiffusion · Concatenated Skip Connection · Max Pooling · Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · U-Net