VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation
Saksham Singh Kushwaha, Yapeng Tian

TL;DR
VinTAGe introduces a novel flow-based transformer model for holistic audio generation from video and text, effectively capturing both onscreen and offscreen sounds with semantic alignment and temporal synchronization.
Contribution
The paper presents VinTAGe, a joint text and video-conditioned audio generation model that reduces modality bias and introduces a new benchmark dataset for holistic audio synthesis.
Findings
VinTAGe outperforms previous models on the VinTAGe-Bench dataset.
Joint text and visual interaction improves audio generation quality.
State-of-the-art results achieved on VGGSound benchmark.
Abstract
Recent advances in audio generation have focused on text-to-audio (T2A) and video-to-audio (V2A) tasks. However, T2A or V2A methods cannot generate holistic sounds (onscreen and off-screen). This is because T2A cannot generate sounds aligning with onscreen objects, while V2A cannot generate semantically complete (offscreen sounds missing). In this work, we address the task of holistic audio generation: given a video and a text prompt, we aim to generate both onscreen and offscreen sounds that are temporally synchronized with the video and semantically aligned with text and video. Previous approaches for joint text and video-to-audio generation often suffer from modality bias, favoring one modality over the other. To overcome this limitation, we introduce VinTAGe, a flow-based transformer model that jointly considers text and video to guide audio generation. Our framework comprises two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis
