Video-to-Audio Generation with Fine-grained Temporal Semantics
Yuchen Hu, Yu Gu, Chenxing Li, Rilin Chen, Dong Yu

TL;DR
This paper introduces a novel video-to-audio generation framework that leverages fine-grained semantic information from video frames to improve temporal alignment and audio quality, utilizing latent diffusion models and grounding segmentation techniques.
Contribution
It proposes enhancing video-to-audio generation with frame-level semantic extraction using Grounding SAM to achieve better temporal synchronization and audio quality.
Findings
Improved temporal alignment in video-to-audio generation.
Enhanced audio quality demonstrated through objective and subjective metrics.
Effective use of grounding segmentation for semantic extraction.
Abstract
With recent advances of AIGC, video generation have gained a surge of research interest in both academia and industry (e.g., Sora). However, it remains a challenge to produce temporally aligned audio to synchronize the generated video, considering the complicated semantic information included in the latter. In this work, inspired by the recent success of text-to-audio (TTA) generation, we first investigate the video-to-audio (VTA) generation framework based on latent diffusion model (LDM). Similar to latest pioneering exploration in VTA, our preliminary results also show great potentials of LDM in VTA task, but it still suffers from sub-optimal temporal alignment. To this end, we propose to enhance the temporal alignment of VTA with frame-level semantic information. With the recently popular grounding segment anything model (Grounding SAM), we can extract the fine-grained semantics in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Video Analysis and Summarization
MethodsDiffusion · Latent Diffusion Model
