Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

Akio Hayakawa; Masato Ishii; Takashi Shibuya; Yuki Mitsufuji

arXiv:2506.20995·cs.CV·October 8, 2025

Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a step-by-step video-to-audio synthesis method that incrementally generates realistic sounds guided by negative audio cues, improving controllability and audio quality without requiring complex datasets.

Contribution

It presents a novel negative guidance approach for video-to-audio synthesis that enables finer control and better sound separation, trained on standard audiovisual datasets.

Findings

01

Enhanced sound separability at each generation step

02

Improved overall audio quality over baselines

03

Effective training with single-reference datasets

Abstract

We propose a step-by-step video-to-audio (V2A) generation method for finer controllability over the generation process and more realistic audio synthesis. Inspired by traditional Foley workflows, our approach aims to comprehensively capture all sound events induced by a video through the incremental generation of missing sound events. To avoid the need for costly multi-reference video-audio datasets, each generation step is formulated as a negatively guided V2A process that discourages duplication of existing sounds. The guidance model is trained by finetuning a pre-trained V2A model on audio pairs from adjacent segments of the same video, allowing training with standard single-reference audiovisual datasets that are easily accessible. Objective and subjective evaluations demonstrate that our method enhances the separability of generated sounds at each step and improves the overall…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1. This paper is well-written and it is easy to follow. 2. The step-by-step generation process guided by negative audio is a novel and intuitive approach to audio synthesis, offering more control over the final output. 3. The method's ability to train without requiring complex, multi-reference datasets makes it more practical and accessible for researchers. 4. The paper provides a comprehensive evaluation, including both objective metrics and a subjective user study, to validate the effectiv

Weaknesses

1. The motivation of generating multiple audio tracks step by step is not well-verified. The authors should compare such an step-by-step method with those methods with single-step inference with postprocessing (audio tracks decomposition). 2. The dataset used for evaluation ("Multi-Caps VGGSound") was created using a vision-language model without access to the original audio, which could introduce a discrepancy between the text captions and the actual sound events. The authors should provide d

Reviewer 02Rating 2Confidence 4

Strengths

Unlike conventional V2A models that generate an entire audio tracks in a single pass, this paper introduces an incremental refinement framework inspired by real-world Foley workflows. The authors provide a strong theoretical foundation for the effectiveness of Negative Audio Guidance (NAG) by leveraging well-established techniques such as flow matching and classifier-free guidance, and empirically validate it thorough experiments. The proposed framework consistently outperforms state-of-the-art

Weaknesses

1. While the proposed NAG framework introduces a promising way to suppress previously generated audio, it is unclear whether the model can effectively disentangle composited audio to understand which events have already been generated and should be suppressed in subsequent steps. 2. The experimental evaluation relies solely on the VGGSound-based constructed dataset (Multi-Caps VGGSound), which may limit the generalizability of the proposed method. Although the authors proposed a method to const

Reviewer 03Rating 4Confidence 4

Strengths

1. First to explicitly tackle incremental, Foley-style video-to-audio generation via negative conditioning. 2. The method is compatible with existing flow matching diffusion frameworks. 3. The authors propose a new dataset (Multi-Caps VGGSound).

Weaknesses

1. The number of generation steps is fixed to 5 for all 8-second clips, regardless of actual sound density. Adaptive step selection or early stopping could make the framework more efficient and realistic. 2. The gain over the MMAudio baseline is marginal. Since MMAudio already produces high-quality, semantically aligned sounds, it remains unclear why a multi-step process is needed beyond marginal improvements in separability. More compelling examples (e.g., complex or dense soundscapes) would st

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies