VIA: Unified Spatiotemporal Video Adaptation Framework for Global and   Local Video Editing

Jing Gu; Yuwei Fang; Ivan Skorokhodov; Peter Wonka; Xinya Du; Sergey; Tulyakov; Xin Eric Wang

arXiv:2406.12831·cs.CV·March 28, 2025

VIA: Unified Spatiotemporal Video Adaptation Framework for Global and Local Video Editing

Jing Gu, Yuwei Fang, Ivan Skorokhodov, Peter Wonka, Xinya Du, Sergey, Tulyakov, Xin Eric Wang

PDF

Open Access

TL;DR

VIA is a unified framework that enables consistent and precise global and local editing of long videos by adapting pre-trained models for spatiotemporal coherence and control.

Contribution

The paper introduces VIA, a novel spatiotemporal video adaptation framework that improves long video editing consistency and local control through test-time and recursive attention strategies.

Findings

01

Produces more faithful and coherent video edits

02

Achieves consistent long video editing in minutes

03

Outperforms baseline methods in accuracy and control

Abstract

Video editing serves as a fundamental pillar of digital media, spanning applications in entertainment, education, and professional communication. However, previous methods often overlook the necessity of comprehensively understanding both global and local contexts, leading to inaccurate and inconsistent edits in the spatiotemporal dimension, especially for long videos. In this paper, we introduce VIA, a unified spatiotemporal Video Adaptation framework for global and local video editing, pushing the limits of consistently editing minute-long videos. First, to ensure local consistency within individual frames, we designed test-time editing adaptation to adapt a pre-trained image editing model for improving consistency between potential editing directions and the text instruction, and adapts masked latent variables for precise local control. Furthermore, to maintain global consistency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Vision and Imaging

MethodsSoftmax · Attention Is All You Need