Good Noise Makes Good Edits: A Training-Free Diffusion-Based Video Editing with Image and Text Prompts

Saemee Choi; Sohyun Jeong; Hyojin Jang; Jaegul Choo; Jinhee Kim

arXiv:2506.12520·cs.CV·December 23, 2025

Good Noise Makes Good Edits: A Training-Free Diffusion-Based Video Editing with Image and Text Prompts

Saemee Choi, Sohyun Jeong, Hyojin Jang, Jaegul Choo, Jinhee Kim

PDF

Open Access

TL;DR

VINO is a novel zero-shot, training-free video editing method that uses structured noise maps and controllable prompts to produce coherent edits guided by image and text inputs.

Contribution

It introduces $ ho$-start sampling, dilated dual masking, and zero image guidance for effective, training-free video editing conditioned on image and text prompts.

Findings

01

Achieves faithful incorporation of reference images into video edits.

02

Outperforms state-of-the-art baselines in quality and coherence.

03

Operates without test-time or instance-specific training.

Abstract

We propose VINO, the first zero-shot, training-free video editing method conditioned on both image and text. Our approach introduces $ρ$ -start sampling and dilated dual masking to construct structured noise maps that enable coherent and accurate edits. To further enhance visual fidelity, we present zero image guidance, a controllable negative prompt strategy. Extensive experiments demonstrate that VINO faithfully incorporates the reference image into video edits, achieving strong performance compared to state-of-the-art baselines, all without any test-time or instance-specific training.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis