Video-to-Audio Generation with Fine-grained Temporal Semantics

Yuchen Hu; Yu Gu; Chenxing Li; Rilin Chen; Dong Yu

arXiv:2409.14709·eess.AS·September 24, 2024

Video-to-Audio Generation with Fine-grained Temporal Semantics

Yuchen Hu, Yu Gu, Chenxing Li, Rilin Chen, Dong Yu

PDF

Open Access

TL;DR

This paper introduces a novel video-to-audio generation framework that leverages fine-grained semantic information from video frames to improve temporal alignment and audio quality, utilizing latent diffusion models and grounding segmentation techniques.

Contribution

It proposes enhancing video-to-audio generation with frame-level semantic extraction using Grounding SAM to achieve better temporal synchronization and audio quality.

Findings

01

Improved temporal alignment in video-to-audio generation.

02

Enhanced audio quality demonstrated through objective and subjective metrics.

03

Effective use of grounding segmentation for semantic extraction.

Abstract

With recent advances of AIGC, video generation have gained a surge of research interest in both academia and industry (e.g., Sora). However, it remains a challenge to produce temporally aligned audio to synchronize the generated video, considering the complicated semantic information included in the latter. In this work, inspired by the recent success of text-to-audio (TTA) generation, we first investigate the video-to-audio (VTA) generation framework based on latent diffusion model (LDM). Similar to latest pioneering exploration in VTA, our preliminary results also show great potentials of LDM in VTA task, but it still suffers from sub-optimal temporal alignment. To this end, we propose to enhance the temporal alignment of VTA with frame-level semantic information. With the recently popular grounding segment anything model (Grounding SAM), we can extract the fine-grained semantics in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Video Analysis and Summarization

MethodsDiffusion · Latent Diffusion Model