MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via   Mask-Guided Attention Modulation

Haoyu Zheng; Wenqiao Zhang; Zheqi Lv; Yu Zhong; Yang Dai; Jianxiang; An; Yongliang Shen; Juncheng Li; Dongping Zhang; Siliang Tang; Yueting Zhuang

arXiv:2412.19978·cs.CV·December 31, 2024

MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation

Haoyu Zheng, Wenqiao Zhang, Zheqi Lv, Yu Zhong, Yang Dai, Jianxiang, An, Yongliang Shen, Juncheng Li, Dongping Zhang, Siliang Tang, Yueting Zhuang

PDF

Open Access

TL;DR

MAKIMA is a tuning-free framework for multi-attribute open-domain video editing that leverages mask-guided attention modulation and feature propagation to improve editing precision, consistency, and efficiency without additional fine-tuning.

Contribution

The paper introduces MAKIMA, a novel tuning-free multi-attribute video editing method that uses mask-guided attention modulation and feature propagation based on pretrained text-to-image models.

Findings

01

Outperforms existing methods in editing accuracy

02

Achieves superior temporal consistency in videos

03

Maintains computational efficiency during editing

Abstract

Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks. However, their focus is primarily on global video modifications, and achieving desired attribute-specific changes remains a challenging task, specifically in multi-attribute editing (MAE) in video. Contemporary video editing approaches either require extensive fine-tuning or rely on additional networks (such as ControlNet) for modeling multi-object appearances, yet they remain in their infancy, offering only coarse-grained MAE solutions. In this paper, we present MAKIMA, a tuning-free MAE framework built upon pretrained T2I models for open-domain video editing. Our approach preserves video structure and appearance information by incorporating attention maps and features from the inversion process during denoising. To facilitate precise editing of multiple attributes, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques

MethodsSoftmax · Attention Is All You Need · Masked autoencoder · Focus