Taming Diffusion Transformer for Efficient Mobile Video Generation in Seconds
Yushu Wu, Yanyu Li, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ke Ma, Arpit Sahni, Ju Hu, Aliaksandr Siarohin, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov

TL;DR
This paper introduces a series of optimizations including model compression, pruning, and distillation techniques to enable real-time video generation on mobile devices using diffusion transformers.
Contribution
The authors propose novel methods to significantly accelerate diffusion transformer-based video generation, making it feasible for resource-constrained mobile platforms.
Findings
Achieves approximately 15 FPS on iPhone 16 Pro Max
Reduces model size while maintaining quality
Cuts inference steps to four for efficiency
Abstract
Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and practical on-device generation is even more challenging. In this work, we propose a series of novel optimizations to significantly accelerate video generation and enable practical deployment on mobile platforms. First, we employ a highly compressed variational autoencoder (VAE) to reduce the dimensionality of the input data without sacrificing visual quality. Second, we introduce a KD-guided, sensitivity-aware tri-level pruning strategy to shrink the model size to suit mobile platforms while preserving critical performance characteristics. Third, we develop an adversarial step distillation technique tailored for DiT, which allows us to reduce the number of inference steps to four. Combined,…
Peer Reviews
Decision·Submitted to ICLR 2026
S1. This paper tackles a practical problem of on-device DiT video generation and provides good deployment results. S2. The proposed pruning with a KD-guided framework yields substantial speedups with moderate quality drop. S3. Demonstrating real-time performance on mobile hardware is a meaningful empirical result.
W1. The novelty is somewhat limited, as the proposed approach mainly combines existing compression, pruning, and distillation techniques, rather than introducing new algorithmic ideas. W2. The contribution is engineering-driven, focusing on system and deployment optimizations, with relatively limited new ML insights or principles that generalize beyond this specific application. W3. The evaluation is not fully convincing, as it lacks comparisons with recent efficient video diffusion and on-dev
1.The authors propose an innovative model acceleration method that addresses the deployment challenges of DiT on mobile devices. By combining VAE compression, pruning, distillation, and other techniques, they achieve real-time video generation. 2.The experiments are thorough, with strong supporting arguments, and the writing is clear and easy to understand.
1.While the model performance is improved after three layers of pruning and distillation, the distillation process requires significant computational power, which makes training more challenging. 2.The framework was tested on the iPhone 16 Pro Max, but its performance may depend on the specific hardware architecture and optimization strategies. The differences in memory bandwidth and computational power across various edge devices could affect the model’s performance, especially on older device
1) The paper is well written and motivated; it is easy to read and understand. 2) One of the pioneering works in the niche application: one of the first video diffusion transformer models running on-device with decent quality. 3) Comprehensive evaluation, both automated quality metrics and user studies are conducted, and good results are reported. 4) Somewhat novel design of adversarial step distillation setup adapted to the new architecture (DiTs). 5) Interesting discussions about quality-vs-ef
1) My main concern with this work is limited novelty: while the engineering contributions are solid and well-executed, the paper primarily combines existing techniques, compression via VAE, pruning, distillation, and operator-level optimization, without introducing fundamentally new ideas or theoretical insights. The novelty lies more in integration than in conceptual advancement. 2) The pruning and distillation strategies, although tailored for DiTs, follow well-established paradigms. The tri-l
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Coding and Compression Technologies · Advanced Data Compression Techniques · Image and Video Quality Assessment
