Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance
Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

TL;DR
This paper introduces a step-by-step video-to-audio synthesis method that incrementally generates realistic sounds guided by negative audio cues, improving controllability and audio quality without requiring complex datasets.
Contribution
It presents a novel negative guidance approach for video-to-audio synthesis that enables finer control and better sound separation, trained on standard audiovisual datasets.
Findings
Enhanced sound separability at each generation step
Improved overall audio quality over baselines
Effective training with single-reference datasets
Abstract
We propose a step-by-step video-to-audio (V2A) generation method for finer controllability over the generation process and more realistic audio synthesis. Inspired by traditional Foley workflows, our approach aims to comprehensively capture all sound events induced by a video through the incremental generation of missing sound events. To avoid the need for costly multi-reference video-audio datasets, each generation step is formulated as a negatively guided V2A process that discourages duplication of existing sounds. The guidance model is trained by finetuning a pre-trained V2A model on audio pairs from adjacent segments of the same video, allowing training with standard single-reference audiovisual datasets that are easily accessible. Objective and subjective evaluations demonstrate that our method enhances the separability of generated sounds at each step and improves the overall…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This paper is well-written and it is easy to follow. 2. The step-by-step generation process guided by negative audio is a novel and intuitive approach to audio synthesis, offering more control over the final output. 3. The method's ability to train without requiring complex, multi-reference datasets makes it more practical and accessible for researchers. 4. The paper provides a comprehensive evaluation, including both objective metrics and a subjective user study, to validate the effectiv
1. The motivation of generating multiple audio tracks step by step is not well-verified. The authors should compare such an step-by-step method with those methods with single-step inference with postprocessing (audio tracks decomposition). 2. The dataset used for evaluation ("Multi-Caps VGGSound") was created using a vision-language model without access to the original audio, which could introduce a discrepancy between the text captions and the actual sound events. The authors should provide d
Unlike conventional V2A models that generate an entire audio tracks in a single pass, this paper introduces an incremental refinement framework inspired by real-world Foley workflows. The authors provide a strong theoretical foundation for the effectiveness of Negative Audio Guidance (NAG) by leveraging well-established techniques such as flow matching and classifier-free guidance, and empirically validate it thorough experiments. The proposed framework consistently outperforms state-of-the-art
1. While the proposed NAG framework introduces a promising way to suppress previously generated audio, it is unclear whether the model can effectively disentangle composited audio to understand which events have already been generated and should be suppressed in subsequent steps. 2. The experimental evaluation relies solely on the VGGSound-based constructed dataset (Multi-Caps VGGSound), which may limit the generalizability of the proposed method. Although the authors proposed a method to const
1. First to explicitly tackle incremental, Foley-style video-to-audio generation via negative conditioning. 2. The method is compatible with existing flow matching diffusion frameworks. 3. The authors propose a new dataset (Multi-Caps VGGSound).
1. The number of generation steps is fixed to 5 for all 8-second clips, regardless of actual sound density. Adaptive step selection or early stopping could make the framework more efficient and realistic. 2. The gain over the MMAudio baseline is marginal. Since MMAudio already produces high-quality, semantically aligned sounds, it remains unclear why a multi-step process is needed beyond marginal improvements in separability. More compelling examples (e.g., complex or dense soundscapes) would st
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies
