MoViE: Mobile Diffusion for Video Editing

Adil Karjauv; Noor Fathima; Ioannis Lelekas; Fatih Porikli; and Amir Ghodrati; Amirhossein Habibian

arXiv:2412.06578·cs.CV·December 10, 2024

MoViE: Mobile Diffusion for Video Editing

Adil Karjauv, Noor Fathima, Ioannis Lelekas, Fatih Porikli, and Amir Ghodrati, Amirhossein Habibian

PDF

Open Access 3 Reviews

TL;DR

This paper presents a series of optimizations for diffusion-based video editing that enable real-time editing on mobile devices without sacrificing quality.

Contribution

It introduces architectural optimizations, a lightweight autoencoder, and a novel adversarial distillation scheme to achieve fast, high-quality mobile video editing.

Findings

01

Achieves 12 fps video editing on mobile devices.

02

Maintains high editing quality with reduced computational steps.

03

Introduces a new adversarial distillation method for controllability.

Abstract

Recent progress in diffusion-based video editing has shown remarkable potential for practical applications. However, these methods remain prohibitively expensive and challenging to deploy on mobile devices. In this study, we introduce a series of optimizations that render mobile video editing feasible. Building upon the existing image editing model, we first optimize its architecture and incorporate a lightweight autoencoder. Subsequently, we extend classifier-free guidance distillation to multiple modalities, resulting in a threefold on-device speedup. Finally, we reduce the number of sampling steps to one by introducing a novel adversarial distillation scheme which preserves the controllability of the editing process. Collectively, these optimizations enable video editing at 12 frames per second on mobile devices, while maintaining high quality. Our results are available at…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. This work proposes a comprehensive and end-to-end workflow that enables on-device video editing, which is well motivated and an important research direction. 2. The proposed Multimodal CFG distillation extends single-modality distillation to text+image with explicit scale inputs, and is simple to adopt in other editing pipelines. 3. The paper is clearly written and easy to follow. The related work section is comprehensive, and the method is elaborated clearly in detailed workflows.

Weaknesses

1. This work is more engineering-oriented. Most investigated components are well-established, including efficient VAE, CFG distillation, and adversarial step distillation. 2. One big concern is the actual editing quality. The edited videos (e.g., color/style change, adding objects, altering weather, etc.) look semantically correct in the sense that the edit direction roughly follows the prompt (e.g., “make it snow,” “turn day to night”). But the texture detail, lighting continuity, and tempora

Reviewer 02Rating 4Confidence 5

Strengths

Mobile-Pix2Pix achieves substantial efficiency gains without noticeable perceptual degradation by pruning high-resolution attention layers and replacing the standard VAE with the Tiny Autoencoder for Stable Diffusion (TAESD), a lightweight deterministic model trained using adversarial and reconstruction objectives. Multimodal Guidance Distillation extends classifier-free guidance to jointly handle text and image modalities. The guidance scales are embedded into the UNet’s ResNet blocks, allowin

Weaknesses

I view the multimodal guidance distillation as a direct extension of Meng et al. [1]. The paper would benefit from a deeper discussion of any non-trivial technical challenges or unique insights encountered when generalizing this approach from a single text modality to both text and image modalities. Such clarification would help distinguish the contribution beyond an incremental adaptation. Competing models such as TokenFlow are evaluated at 50 diffusion steps. Considering that this paper’s mai

Reviewer 03Rating 4Confidence 3

Strengths

**Significant efficiency gains**: The proposed methods deliver remarkable acceleration (up to 12 FPS on mobile), achieving over 10× speed-up compared to prior diffusion-based video editing approaches while maintaining controllability. The work meaningfully advances the feasibility of deploying diffusion-based video editing models on edge devices. **Clear and well-presented writing**: The paper is clearly written, logically structured, and easy to follow. Figures and tables are informative and e

Weaknesses

**Sacrificed editing quality and limited trade-off analysis between efficiency and quality**: While the acceleration results are impressive, the editing quality is noticeably sacrificed. As shown in Fig. 6 and Fig. 9, although the outputs generally align with the target prompts, the background consistency and preservation of non-edited attributes are suboptimal. The paper would benefit from a clearer analysis of how editing quality degrades with increasing speed. A figure similar to Fig. 5, expl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimedia Communication and Technology