Mobile Video Diffusion
Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas, Amir, Ghodrati, Amirhossein Habibian

TL;DR
This paper presents MobileVD, a mobile-optimized video diffusion model that significantly reduces computational costs while maintaining acceptable quality, enabling real-time video generation on mobile devices.
Contribution
Introduces MobileVD, the first efficient mobile-optimized video diffusion model with novel pruning and adversarial finetuning techniques for real-time performance.
Findings
MobileVD is 523x more efficient than previous models.
Generates 14x512x256 px clips in 1.7 seconds on a mobile device.
Achieves a slight quality drop with FVD of 149.
Abstract
Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques
MethodsPruning · Diffusion · Contrastive Language-Image Pre-training
