V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by   Connecting Foundation Models

Heng Wang; Jianbo Ma; Santiago Pascual; Richard Cartwright; Weidong; Cai

arXiv:2308.09300·cs.CV·December 15, 2023

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, Weidong, Cai

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces V2A-Mapper, a lightweight method that leverages foundation models to generate semantically-relevant audio from visual inputs, significantly reducing training complexity while improving quality and relevance.

Contribution

It proposes a simple mapper mechanism to bridge visual and auditory model spaces, enabling high-fidelity, visually-aligned sound generation with minimal training.

Findings

01

Outperforms state-of-the-art methods in fidelity and relevance metrics.

02

Requires 86% fewer parameters than previous approaches.

03

Achieves 53% improvement in fidelity and 19% in relevance.

Abstract

Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

heng-hw/V2A-Mapper
noneOfficial

Videos

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models· underline

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing

MethodsContrastive Language-Image Pre-training