VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing
Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang

TL;DR
VideoGrain introduces a zero-shot method that modulates space-time attention in diffusion models to enable precise multi-grained video editing, addressing semantic alignment and feature coupling challenges.
Contribution
It proposes a novel attention modulation technique for fine-grained, multi-level video editing within diffusion models, enhancing control and separation of features.
Findings
Achieves state-of-the-art results in multi-grained video editing
Effectively improves text-to-region control accuracy
Reduces feature interference for clearer editing
Abstract
Recent advancements in diffusion models have significantly improved video generation and editing capabilities. However, multi-grained video editing, which encompasses class-level, instance-level, and part-level modifications, remains a formidable challenge. The major difficulties in multi-grained editing include semantic misalignment of text-to-region control and feature coupling within the diffusion model. To address these difficulties, we present VideoGrain, a zero-shot approach that modulates space-time (cross- and self-) attention mechanisms to achieve fine-grained control over video content. We enhance text-to-region control by amplifying each local prompt's attention to its corresponding spatial-disentangled region while minimizing interactions with irrelevant areas in cross-attention. Additionally, we improve feature separation by increasing intra-region awareness and reducing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis
MethodsSoftmax · Attention Is All You Need · Diffusion
