Neodragon: Mobile Video Generation using Diffusion Transformer
Animesh Karnewar, Denis Korzhenkov, Ioannis Lelekas, Adil Karjauv, Noor Fathima, Hanwen Xiong, Vancheeswaran Vaidyanathan, Will Zeng, Rafael Esteves, Tushar Singhal, Fatih Porikli, Mohsen Ghafoorian, Amirhossein Habibian

TL;DR
Neodragon is a mobile-optimized text-to-video system that generates high-quality 2-second videos efficiently on low-resource hardware through innovative model compression and acceleration techniques.
Contribution
The paper introduces novel distillation and pruning methods to optimize transformer-based video generation models for mobile devices, achieving real-time performance.
Findings
Generates 2-second videos at 7 FPS on mobile hardware
Reduces model size and runtime significantly
Maintains high video quality with optimized models
Abstract
We introduce Neodragon, a text-to-video system capable of generating 2s (49 frames @24 fps) videos at the 640x1024 resolution directly on a Qualcomm Hexagon NPU in a record 6.7s (7 FPS). Differing from existing transformer-based offline text-to-video generation models, Neodragon is the first to have been specifically optimised for mobile hardware to achieve efficient and high-fidelity video synthesis. We achieve this through four key technical contributions: (1) Replacing the original large 4.762B T5xxl Text-Encoder with a much smaller 0.2B DT5 (DistilT5) with minimal quality loss, enabled through a novel Text-Encoder Distillation procedure. (2) Proposing an Asymmetric Decoder Distillation approach allowing us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the generation pipeline. (3) Pruning of MMDiT blocks…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper presents a clear, step-by-step framework for on-device video generation with competitive Vbench. - Text-only distillation for the text-encoder is well-explained and practically motivated. - The step-distillation for pyramidal model adaptation is novel and empirically effective under 4‑4‑4 Tab.3, with a dqualitative result of tradeoff.
- The paper does not include a user study or human evaluation of the generated videos, which would strengthen the perceptual quality claims. - The hardware/deployment setup is under-specified — key details such as peak memory, per-module latency, data types/quantization, and runtime environment are missing. - The claim that the visual and textual importance of a block are uncorrelated is interesting, but currently under-supported; additional evidence/analysis would make this argument more convin
1. The paper successfully transforms a large, server-side Diffusion Transformer into a highly efficient, deployable mobile solution. 2. The paper is well-structured, featuring a data-driven abstract and logical presentation of the methodology. 3. The paper provides a finding that the Cosine Distance loss is indispensable for stabilizing the text encoder distillation process. This highlights the crucial role of preserving the directional coherence of text embeddings for the downstream attention
1. The final performance claim (VBench 81.61) is achieved by a hybrid pipeline that uses external models (SSD-IB for high-quality first-frame initialization and QuickSRNet for super-resolution) to compensate for artifacts caused by aggressive step distillation in the core model. 2. The key assumptions underpinning the core compression strategies, such as the "universality of compressed video latent space" and the concept of "similarly shallow semantic demands" for large language models, are supp
1. The motivation is clear and meaningful as a on-device design of video generation model are important for phones or laptops. 2. The experiments are thorough, providing a detailed exploration of various compression techniques. 3. The paper has adapted various components in the video generation model on the edge-device, which is a very systematic project. 4. The proposed model has good performance and can achieve advanced generation performance on the edge-device.
1. The display of some relational data in the paper is not intuitive, such as some key metrics for end-to-end deployment such as memory consumption and latency, which are only explained in text. Providing some tables or charts would be more obvious. 2. The paper should provide clearer references to some of the baseline choices made in the appendix regarding the baseline method, such as why Pyramidal Flow was chosen. 3. More efficiency should be reported, such as memory consumption and latency
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Embedded Systems Design Techniques · Human Motion and Animation
