From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation
Kun Su, Xiulong Liu, Eli Shlizerman

TL;DR
This paper introduces VAB, a unified model that learns representations and generates audio from visual data within latent spaces, enabling efficient cross-modal tasks and high-quality audio generation from videos.
Contribution
VAB is the first unified framework combining audio-visual representation learning and vision-to-audio generation using latent space modeling.
Findings
VAB achieves high-quality audio generation from videos.
The model performs well in audio-visual retrieval tasks.
VAB demonstrates effective semantic feature learning across modalities.
Abstract
Video encompasses both visual and auditory data, creating a perceptually rich experience where these two modalities complement each other. As such, videos are a valuable type of media for the investigation of the interplay between audio and visual elements. Previous studies of audio-visual modalities primarily focused on either audio-visual representation learning or generative modeling of a modality conditioned on the other, creating a disconnect between these two branches. A unified framework that learns representation and generates modalities has not been developed yet. In this work, we introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation. The key approach of VAB is that rather than working with raw video frames and audio data, VAB performs representation learning and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Multisensory perception and integration · Music and Audio Processing
