Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation
Jiawei Liu, Weining Wang, Sihan Chen, Xinxin Zhu, Jing Liu

TL;DR
This paper introduces the Sounding Video Generator (SVG), a unified framework that generates realistic videos with synchronized audio from text prompts, leveraging novel contrastive learning and cross-modal attention.
Contribution
The work presents SVG-VQGAN and a new dataset, AudioSetCap, enabling joint text, visual, and audio video generation with improved consistency and realism.
Findings
Outperforms existing text-to-video generation methods.
Achieves better audio-visual synchronization.
Demonstrates effectiveness on Kinetics and VAS datasets.
Abstract
As a combination of visual and audio signals, video is inherently multi-modal. However, existing video generation methods are primarily intended for the synthesis of visual frames, whereas audio signals in realistic videos are disregarded. In this work, we concentrate on a rarely investigated problem of text guided sounding video generation and propose the Sounding Video Generator (SVG), a unified framework for generating realistic videos along with audio signals. Specifically, we present the SVG-VQGAN to transform visual frames and audio melspectrograms into discrete tokens. SVG-VQGAN applies a novel hybrid contrastive learning method to model inter-modal and intra-modal consistency and improve the quantized representations. A cross-modal attention module is employed to extract associated features of visual frames and audio signals for contrastive learning. Then, a Transformer-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Multimodal Machine Learning Applications
MethodsContrastive Learning
