STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

Yong Ren; Chenxing Li; Manjie Xu; Wei Liang; Yu Gu; Rilin Chen; Dong; Yu

arXiv:2409.08601·cs.SD·March 25, 2025

STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

Yong Ren, Chenxing Li, Manjie Xu, Wei Liang, Yu Gu, Rilin Chen, Dong, Yu

PDF

Open Access 1 Repo

TL;DR

STA-V2A introduces a novel video-to-audio generation method that leverages semantic and temporal alignment to produce more coherent and synchronized audio from videos, surpassing existing models in quality and consistency.

Contribution

The paper presents a new approach combining local temporal and global semantic video features with cross-modal guidance and introduces a novel metric for audio-temporal alignment.

Findings

01

Outperforms existing Video-to-Audio models in quality and alignment

02

Effective use of onset prediction and attentive pooling modules

03

Validated by subjective and objective evaluations

Abstract

Visual and auditory perception are two crucial ways humans experience the world. Text-to-video generation has made remarkable progress over the past year, but the absence of harmonious audio in generated video limits its broader applications. In this paper, we propose Semantic and Temporal Aligned Video-to-Audio (STA-V2A), an approach that enhances audio generation from videos by extracting both local temporal and global semantic video features and combining these refined video features with text as cross-modal guidance. To address the issue of information redundancy in videos, we propose an onset prediction pretext task for local temporal feature extraction and an attentive pooling module for global semantic feature extraction. To supplement the insufficient semantic information in videos, we propose a Latent Diffusion Model with Text-to-Audio priors initialization and cross-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

y-ren16/stav2a
jax

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies