Good Noise Makes Good Edits: A Training-Free Diffusion-Based Video Editing with Image and Text Prompts
Saemee Choi, Sohyun Jeong, Hyojin Jang, Jaegul Choo, Jinhee Kim

TL;DR
VINO is a novel zero-shot, training-free video editing method that uses structured noise maps and controllable prompts to produce coherent edits guided by image and text inputs.
Contribution
It introduces $ ho$-start sampling, dilated dual masking, and zero image guidance for effective, training-free video editing conditioned on image and text prompts.
Findings
Achieves faithful incorporation of reference images into video edits.
Outperforms state-of-the-art baselines in quality and coherence.
Operates without test-time or instance-specific training.
Abstract
We propose VINO, the first zero-shot, training-free video editing method conditioned on both image and text. Our approach introduces -start sampling and dilated dual masking to construct structured noise maps that enable coherent and accurate edits. To further enhance visual fidelity, we present zero image guidance, a controllable negative prompt strategy. Extensive experiments demonstrate that VINO faithfully incorporates the reference image into video edits, achieving strong performance compared to state-of-the-art baselines, all without any test-time or instance-specific training.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis
