MoCA-Video: Motion-Aware Concept Alignment for Consistent Video Editing

Tong Zhang; Juan C Leon Alcazar; Victor Escorcia; Bernard Ghanem

arXiv:2506.01004·cs.CV·December 15, 2025

MoCA-Video: Motion-Aware Concept Alignment for Consistent Video Editing

Tong Zhang, Juan C Leon Alcazar, Victor Escorcia, Bernard Ghanem

PDF

Open Access

TL;DR

MoCA-Video is a training-free framework that enables consistent and controllable semantic video editing by manipulating diffusion models in the latent space, ensuring temporal stability and high-quality results.

Contribution

It introduces a novel training-free approach for semantic video editing using diffusion models with class-agnostic segmentation and momentum-based correction for temporal coherence.

Findings

01

Outperforms existing training-free and trained methods in semantic mixing.

02

Achieves high temporal stability and semantic alignment without retraining.

03

Demonstrates effective control over semantic shifts in video editing.

Abstract

We present MoCA-Video, a training-free framework for semantic mixing in videos. Operating in the latent space of a frozen video diffusion model, MoCA-Video utilizes class-agnostic segmentation with diagonal denoising scheduler to localize and track the target object across frames. To ensure temporal stability under semantic shifts, we introduce momentum-based correction to approximate novel hybrid distributions beyond trained data distribution, alongside a light gamma residual module that smooths out visual artifacts. We evaluate model's performance using SSIM, LPIPS, and a proposed metric, \metricnameabbr, which quantifies semantic alignment between reference and output. Extensive evaluation demonstrates that our model consistently outperforms both training-free and trained baselines, achieving superior semantic mixing and temporal coherence without retraining. Results establish that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Human Motion and Animation