ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

Kunpeng Du; Haizhen Xie; Sen Lu; Lei Yu; Binglei Bao; Huaao Tang; Chuntao Liu; Hao Wu; Yang Zhao; Zhicai Huang; Heyuan Gao; Zhijun Tu; Jie Hu; Xinghao Chen

arXiv:2605.15684·cs.CV·May 18, 2026

ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

Kunpeng Du, Haizhen Xie, Sen Lu, Lei Yu, Binglei Bao, Huaao Tang, Chuntao Liu, Hao Wu, Yang Zhao, Zhicai Huang, Heyuan Gao, Zhijun Tu, Jie Hu, Xinghao Chen

PDF

TL;DR

ElasticDiT is a flexible diffusion transformer architecture that dynamically balances image quality and computational efficiency on mobile devices through elastic design and sparse attention.

Contribution

It introduces ElasticDiT, a novel model that adjusts spatial compression and depth in real-time, reducing resource usage while maintaining high-quality image generation.

Findings

01

ElasticDiT covers a wide fidelity-latency trade-off range within a single model.

02

The flex lite variant surpasses Flux with an HPS of 32.87.

03

SSBA achieves 84.16% average sparsity, reducing inference costs.

Abstract

The Diffusion Transformer (DiT) architecture is the state-of-the-art paradigm for high-fidelity image generation, underpinning models like Stable Diffusion-3 and FLUX.1. However, deploying these models on resource-constrained mobile devices entails prohibitive computational and memory overhead. While efficiency-driven approaches like Linear-DiT and static pruning alleviate bottlenecks, they often incur quality degradation. Unlike cloud environments, mobile constraints require a single-model paradigm that dynamically balances fidelity and latency. We introduce ElasticDiT, which achieves this dynamic trade-off by adjusting spatial compression ratios and DiT block depths. By integrating Shift Sparse Block Attention (SSBA) and a Tiny DWT-Distilled VAE (T-DVAE), ElasticDiT reduces inference latency and memory footprint while maintaining image quality. Experiments confirm that ElasticDiT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.