From Vision to Audio and Beyond: A Unified Model for Audio-Visual   Representation and Generation

Kun Su; Xiulong Liu; Eli Shlizerman

arXiv:2409.19132·cs.MM·October 1, 2024

From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation

Kun Su, Xiulong Liu, Eli Shlizerman

PDF

Open Access 1 Repo

TL;DR

This paper introduces VAB, a unified model that learns representations and generates audio from visual data within latent spaces, enabling efficient cross-modal tasks and high-quality audio generation from videos.

Contribution

VAB is the first unified framework combining audio-visual representation learning and vision-to-audio generation using latent space modeling.

Findings

01

VAB achieves high-quality audio generation from videos.

02

The model performs well in audio-visual retrieval tasks.

03

VAB demonstrates effective semantic feature learning across modalities.

Abstract

Video encompasses both visual and auditory data, creating a perceptually rich experience where these two modalities complement each other. As such, videos are a valuable type of media for the investigation of the interplay between audio and visual elements. Previous studies of audio-visual modalities primarily focused on either audio-visual representation learning or generative modeling of a modality conditioned on the other, creating a disconnect between these two branches. A unified framework that learns representation and generates modalities has not been developed yet. In this work, we introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation. The key approach of VAB is that rather than working with raw video frames and audio data, VAB performs representation learning and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DragonLiu1995/Vision-to-Audio-and-Beyond
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Multisensory perception and integration · Music and Audio Processing