Neodragon: Mobile Video Generation using Diffusion Transformer

Animesh Karnewar; Denis Korzhenkov; Ioannis Lelekas; Adil Karjauv; Noor Fathima; Hanwen Xiong; Vancheeswaran Vaidyanathan; Will Zeng; Rafael Esteves; Tushar Singhal; Fatih Porikli; Mohsen Ghafoorian; Amirhossein Habibian

arXiv:2511.06055·cs.CV·November 11, 2025

Neodragon: Mobile Video Generation using Diffusion Transformer

Animesh Karnewar, Denis Korzhenkov, Ioannis Lelekas, Adil Karjauv, Noor Fathima, Hanwen Xiong, Vancheeswaran Vaidyanathan, Will Zeng, Rafael Esteves, Tushar Singhal, Fatih Porikli, Mohsen Ghafoorian, Amirhossein Habibian

PDF

Open Access 3 Reviews

TL;DR

Neodragon is a mobile-optimized text-to-video system that generates high-quality 2-second videos efficiently on low-resource hardware through innovative model compression and acceleration techniques.

Contribution

The paper introduces novel distillation and pruning methods to optimize transformer-based video generation models for mobile devices, achieving real-time performance.

Findings

01

Generates 2-second videos at 7 FPS on mobile hardware

02

Reduces model size and runtime significantly

03

Maintains high video quality with optimized models

Abstract

We introduce Neodragon, a text-to-video system capable of generating 2s (49 frames @24 fps) videos at the 640x1024 resolution directly on a Qualcomm Hexagon NPU in a record 6.7s (7 FPS). Differing from existing transformer-based offline text-to-video generation models, Neodragon is the first to have been specifically optimised for mobile hardware to achieve efficient and high-fidelity video synthesis. We achieve this through four key technical contributions: (1) Replacing the original large 4.762B T5xxl Text-Encoder with a much smaller 0.2B DT5 (DistilT5) with minimal quality loss, enabled through a novel Text-Encoder Distillation procedure. (2) Proposing an Asymmetric Decoder Distillation approach allowing us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the generation pipeline. (3) Pruning of MMDiT blocks…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 5

Strengths

- The paper presents a clear, step-by-step framework for on-device video generation with competitive Vbench. - Text-only distillation for the text-encoder is well-explained and practically motivated. - The step-distillation for pyramidal model adaptation is novel and empirically effective under 4‑4‑4 Tab.3, with a dqualitative result of tradeoff.

Weaknesses

- The paper does not include a user study or human evaluation of the generated videos, which would strengthen the perceptual quality claims. - The hardware/deployment setup is under-specified — key details such as peak memory, per-module latency, data types/quantization, and runtime environment are missing. - The claim that the visual and textual importance of a block are uncorrelated is interesting, but currently under-supported; additional evidence/analysis would make this argument more convin

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper successfully transforms a large, server-side Diffusion Transformer into a highly efficient, deployable mobile solution. 2. The paper is well-structured, featuring a data-driven abstract and logical presentation of the methodology. 3. The paper provides a finding that the Cosine Distance loss is indispensable for stabilizing the text encoder distillation process. This highlights the crucial role of preserving the directional coherence of text embeddings for the downstream attention

Weaknesses

1. The final performance claim (VBench 81.61) is achieved by a hybrid pipeline that uses external models (SSD-IB for high-quality first-frame initialization and QuickSRNet for super-resolution) to compensate for artifacts caused by aggressive step distillation in the core model. 2. The key assumptions underpinning the core compression strategies, such as the "universality of compressed video latent space" and the concept of "similarly shallow semantic demands" for large language models, are supp

Reviewer 03Rating 6Confidence 4

Strengths

1. The motivation is clear and meaningful as a on-device design of video generation model are important for phones or laptops. 2. The experiments are thorough, providing a detailed exploration of various compression techniques. 3. The paper has adapted various components in the video generation model on the edge-device, which is a very systematic project. 4. The proposed model has good performance and can achieve advanced generation performance on the edge-device.

Weaknesses

1. The display of some relational data in the paper is not intuitive, such as some key metrics for end-to-end deployment such as memory consumption and latency, which are only explained in text. Providing some tables or charts would be more obvious. 2. The paper should provide clearer references to some of the baseline choices made in the appendix regarding the baseline method, such as why Pyramidal Flow was chosen. 3. More efficiency should be reported, such as memory consumption and latency

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Embedded Systems Design Techniques · Human Motion and Animation