VideoGrain: Modulating Space-Time Attention for Multi-grained Video   Editing

Xiangpeng Yang; Linchao Zhu; Hehe Fan; Yi Yang

arXiv:2502.17258·cs.CV·February 25, 2025

VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing

Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang

PDF

Open Access 1 Datasets 1 Video

TL;DR

VideoGrain introduces a zero-shot method that modulates space-time attention in diffusion models to enable precise multi-grained video editing, addressing semantic alignment and feature coupling challenges.

Contribution

It proposes a novel attention modulation technique for fine-grained, multi-level video editing within diffusion models, enhancing control and separation of features.

Findings

01

Achieves state-of-the-art results in multi-grained video editing

02

Effectively improves text-to-region control accuracy

03

Reduces feature interference for clearer editing

Abstract

Recent advancements in diffusion models have significantly improved video generation and editing capabilities. However, multi-grained video editing, which encompasses class-level, instance-level, and part-level modifications, remains a formidable challenge. The major difficulties in multi-grained editing include semantic misalignment of text-to-region control and feature coupling within the diffusion model. To address these difficulties, we present VideoGrain, a zero-shot approach that modulates space-time (cross- and self-) attention mechanisms to achieve fine-grained control over video content. We enhance text-to-region control by amplifying each local prompt's attention to its corresponding spatial-disentangled region while minimizing interactions with irrelevant areas in cross-attention. Additionally, we improve feature separation by increasing intra-region awareness and reducing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

XiangpengYang/VideoGrain-dataset
dataset· 71 dl
71 dl

Videos

VideoGrain: Modulating Space-Time Attention for Multi-Grained Video Editing· slideslive

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis

MethodsSoftmax · Attention Is All You Need · Diffusion