Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Qi Yang; Binjie Mao; Zili Wang; Xing Nie; Pengfei Gao; Ying Guo; Cheng; Zhen; Pengfei Yan; Shiming Xiang

arXiv:2409.06135·cs.SD·September 12, 2024

Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Qi Yang, Binjie Mao, Zili Wang, Xing Nie, Pengfei Gao, Ying Guo, Cheng, Zhen, Pengfei Yan, Shiming Xiang

PDF

Open Access

TL;DR

This paper introduces Draw an Audio, a controllable video-to-audio synthesis model that uses multi-instruction inputs and novel modules to improve audio-visual synchronization and content consistency, achieving state-of-the-art results.

Contribution

The paper proposes a new V2A model with Mask-Attention and Time-Loudness Modules, enabling multi-instruction control and improved synchronization.

Findings

01

Achieves state-of-the-art performance on large-scale V2A benchmarks.

02

Effectively maintains content consistency between video and generated audio.

03

Enhances synchronization of loudness and temporal properties in synthesized audio.

Abstract

Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience. Video-to-Audio (V2A), as a particular type of automatic foley task, presents inherent challenges related to audio-visual synchronization. These challenges encompass maintaining the content consistency between the input video and the generated audio, as well as the alignment of temporal and loudness properties within the video. To address these issues, we construct a controllable video-to-audio synthesis model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals. To ensure content consistency between the synthesized audio and target video, we introduce the Mask-Attention Module (MAM), which employs masked video instruction to enable the model to focus on regions of interest.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies

MethodsFocus