VALA: Learning Latent Anchors for Training-Free and Temporally Consistent
Zhangkai Wu, Xuhui Fan, Zhongyuan Xie, Kaize Shi, Longbing Cao

TL;DR
VALA introduces a variational alignment module that adaptively selects key frames and compresses their features into semantic anchors, enhancing temporal consistency and efficiency in training-free video editing with diffusion models.
Contribution
It proposes a novel variational framework with contrastive learning to learn meaningful latent anchors for consistent, training-free video editing.
Findings
Achieves state-of-the-art inversion fidelity and editing quality.
Improves temporal consistency in video editing.
Offers enhanced efficiency over prior methods.
Abstract
Recent advances in training-free video editing have enabled lightweight and precise cross-frame generation by leveraging pre-trained text-to-image diffusion models. However, existing methods often rely on heuristic frame selection to maintain temporal consistency during DDIM inversion, which introduces manual bias and reduces the scalability of end-to-end inference. In this paper, we propose~\textbf{VALA} (\textbf{V}ariational \textbf{A}lignment for \textbf{L}atent \textbf{A}nchors), a variational alignment module that adaptively selects key frames and compresses their latent features into semantic anchors for consistent video editing. To learn meaningful assignments, VALA propose a variational framework with a contrastive learning objective. Therefore, it can transform cross-frame latent representations into compressed latent anchors that preserve both content and temporal coherence.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
