Gotta Hear Them All: Towards Sound Source Aware Audio Generation

Wei Guo; Heng Wang; Jianbo Ma; Weidong Cai

arXiv:2411.15447·cs.MM·August 13, 2025

Gotta Hear Them All: Towards Sound Source Aware Audio Generation

Wei Guo, Heng Wang, Jianbo Ma, Weidong Cai

PDF

Open Access 1 Repo

TL;DR

This paper introduces SS2A, a novel sound source-aware audio generator that improves audio synthesis by locally perceiving sound sources and disambiguating their semantics, leading to more immersive and controllable audio generation.

Contribution

The paper proposes a new sound source-aware framework with a curated dataset and novel metrics, advancing the state-of-the-art in image-to-audio and video-to-audio synthesis.

Findings

01

Achieves state-of-the-art results in image-to-audio tasks.

02

Demonstrates intuitive control over audio synthesis via visual and textual inputs.

03

Performs competitively in video-to-audio tasks with simple temporal aggregation.

Abstract

Audio synthesis has broad applications in multimedia. Recent advancements have made it possible to generate relevant audios from inputs describing an audio scene, such as images or texts. However, the immersiveness and expressiveness of the generation are limited. One possible problem is that existing methods solely rely on the global scene and overlook details of local sounding objects (i.e., sound sources). To address this issue, we propose a Sound Source-Aware Audio (SS2A) generator. SS2A is able to locally perceive multimodal sound sources from a scene with visual detection and cross-modality translation. It then contrastively learns a Cross-Modal Sound Source (CMSS) Manifold to semantically disambiguate each source. Finally, we attentively mix their CMSS semantics into a rich audio representation, from which a pretrained audio generator outputs the sound. To model the CMSS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wguo86/ssv2a
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing