MoViE: Mobile Diffusion for Video Editing
Adil Karjauv, Noor Fathima, Ioannis Lelekas, Fatih Porikli, and Amir Ghodrati, Amirhossein Habibian

TL;DR
This paper presents a series of optimizations for diffusion-based video editing that enable real-time editing on mobile devices without sacrificing quality.
Contribution
It introduces architectural optimizations, a lightweight autoencoder, and a novel adversarial distillation scheme to achieve fast, high-quality mobile video editing.
Findings
Achieves 12 fps video editing on mobile devices.
Maintains high editing quality with reduced computational steps.
Introduces a new adversarial distillation method for controllability.
Abstract
Recent progress in diffusion-based video editing has shown remarkable potential for practical applications. However, these methods remain prohibitively expensive and challenging to deploy on mobile devices. In this study, we introduce a series of optimizations that render mobile video editing feasible. Building upon the existing image editing model, we first optimize its architecture and incorporate a lightweight autoencoder. Subsequently, we extend classifier-free guidance distillation to multiple modalities, resulting in a threefold on-device speedup. Finally, we reduce the number of sampling steps to one by introducing a novel adversarial distillation scheme which preserves the controllability of the editing process. Collectively, these optimizations enable video editing at 12 frames per second on mobile devices, while maintaining high quality. Our results are available at…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This work proposes a comprehensive and end-to-end workflow that enables on-device video editing, which is well motivated and an important research direction. 2. The proposed Multimodal CFG distillation extends single-modality distillation to text+image with explicit scale inputs, and is simple to adopt in other editing pipelines. 3. The paper is clearly written and easy to follow. The related work section is comprehensive, and the method is elaborated clearly in detailed workflows.
1. This work is more engineering-oriented. Most investigated components are well-established, including efficient VAE, CFG distillation, and adversarial step distillation. 2. One big concern is the actual editing quality. The edited videos (e.g., color/style change, adding objects, altering weather, etc.) look semantically correct in the sense that the edit direction roughly follows the prompt (e.g., “make it snow,” “turn day to night”). But the texture detail, lighting continuity, and tempora
Mobile-Pix2Pix achieves substantial efficiency gains without noticeable perceptual degradation by pruning high-resolution attention layers and replacing the standard VAE with the Tiny Autoencoder for Stable Diffusion (TAESD), a lightweight deterministic model trained using adversarial and reconstruction objectives. Multimodal Guidance Distillation extends classifier-free guidance to jointly handle text and image modalities. The guidance scales are embedded into the UNet’s ResNet blocks, allowin
I view the multimodal guidance distillation as a direct extension of Meng et al. [1]. The paper would benefit from a deeper discussion of any non-trivial technical challenges or unique insights encountered when generalizing this approach from a single text modality to both text and image modalities. Such clarification would help distinguish the contribution beyond an incremental adaptation. Competing models such as TokenFlow are evaluated at 50 diffusion steps. Considering that this paper’s mai
**Significant efficiency gains**: The proposed methods deliver remarkable acceleration (up to 12 FPS on mobile), achieving over 10× speed-up compared to prior diffusion-based video editing approaches while maintaining controllability. The work meaningfully advances the feasibility of deploying diffusion-based video editing models on edge devices. **Clear and well-presented writing**: The paper is clearly written, logically structured, and easy to follow. Figures and tables are informative and e
**Sacrificed editing quality and limited trade-off analysis between efficiency and quality**: While the acceleration results are impressive, the editing quality is noticeably sacrificed. As shown in Fig. 6 and Fig. 9, although the outputs generally align with the target prompts, the background consistency and preservation of non-edited attributes are suboptimal. The paper would benefit from a clearer analysis of how editing quality degrades with increasing speed. A figure similar to Fig. 5, expl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimedia Communication and Technology
