Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation   Under Semantic Guidance

Yaoyun Zhang; Xuenan Xu; Mengyue Wu

arXiv:2412.18157·cs.SD·December 25, 2024

Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance

Yaoyun Zhang, Xuenan Xu, Mengyue Wu

PDF

Open Access

TL;DR

Smooth-Foley is a novel video-to-audio generation model that uses semantic guidance from text to improve audio-video alignment and produce continuous, high-quality Foley sounds, especially in challenging scenarios with moving visuals.

Contribution

It introduces a semantic-guided generative model with specialized adapters to enhance temporal and semantic alignment in video-to-audio synthesis.

Findings

01

Outperforms existing models in continuous sound scenarios

02

Achieves higher quality and better physical law adherence

03

Improves semantic and temporal alignment in audio generation

Abstract

The video-to-audio (V2A) generation task has drawn attention in the field of multimedia due to the practicality in producing Foley sound. Semantic and temporal conditions are fed to the generation model to indicate sound events and temporal occurrence. Recent studies on synthesizing immersive and synchronized audio are faced with challenges on videos with moving visual presence. The temporal condition is not accurate enough while low-resolution semantic condition exacerbates the problem. To tackle these challenges, we propose Smooth-Foley, a V2A generative model taking semantic guidance from the textual label across the generation to enhance both semantic and temporal alignment in audio. Two adapters are trained to leverage pre-trained text-to-audio generation models. A frame adapter integrates high-resolution frame-wise video features while a temporal adapter integrates temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing

MethodsSoftmax · Attention Is All You Need · Adapter