Cut-and-Paste: Subject-Driven Video Editing with Attention Control
Zhichao Zuo, Zhao Zhang, Yan Luo, Yang Zhao, Haijun Zhang, Yi Yang,, Meng Wang

TL;DR
This paper introduces a subject-driven video editing framework called Cut-and-Paste that uses text prompts and reference images to achieve precise, fine-grained semantic edits with improved control and consistency.
Contribution
The paper proposes a novel reference image-guided video editing method that extends attention control from images to videos, enabling more accurate and consistent semantic edits.
Findings
Outperforms prior methods in quantitative evaluations.
Achieves better background preservation and spatio-temporal consistency.
Enables precise object editing with less cumbersome prompts.
Abstract
This paper presents a novel framework termed Cut-and-Paste for real-word semantic video editing under the guidance of text prompt and additional reference image. While the text-driven video editing has demonstrated remarkable ability to generate highly diverse videos following given text prompts, the fine-grained semantic edits are hard to control by plain textual prompt only in terms of object details and edited region, and cumbersome long text descriptions are usually needed for the task. We therefore investigate subject-driven video editing for more precise control of both edited regions and background preservation, and fine-grained semantic generation. We achieve this goal by introducing an reference image as supplementary input to the text-driven video editing, which avoids racking your brain to come up with a cumbersome text prompt describing the detailed appearance of the object.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
