MoCA-Video: Motion-Aware Concept Alignment for Consistent Video Editing
Tong Zhang, Juan C Leon Alcazar, Victor Escorcia, Bernard Ghanem

TL;DR
MoCA-Video is a training-free framework that enables consistent and controllable semantic video editing by manipulating diffusion models in the latent space, ensuring temporal stability and high-quality results.
Contribution
It introduces a novel training-free approach for semantic video editing using diffusion models with class-agnostic segmentation and momentum-based correction for temporal coherence.
Findings
Outperforms existing training-free and trained methods in semantic mixing.
Achieves high temporal stability and semantic alignment without retraining.
Demonstrates effective control over semantic shifts in video editing.
Abstract
We present MoCA-Video, a training-free framework for semantic mixing in videos. Operating in the latent space of a frozen video diffusion model, MoCA-Video utilizes class-agnostic segmentation with diagonal denoising scheduler to localize and track the target object across frames. To ensure temporal stability under semantic shifts, we introduce momentum-based correction to approximate novel hybrid distributions beyond trained data distribution, alongside a light gamma residual module that smooths out visual artifacts. We evaluate model's performance using SSIM, LPIPS, and a proposed metric, \metricnameabbr, which quantifies semantic alignment between reference and output. Extensive evaluation demonstrates that our model consistently outperforms both training-free and trained baselines, achieving superior semantic mixing and temporal coherence without retraining. Results establish that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Human Motion and Animation
