SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound

Rishit Dagli; Shivesh Prakash; Robert Wu; and Houman Khosravani

arXiv:2406.06612·cs.CV·July 8, 2025·1 cites

SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound

Rishit Dagli, Shivesh Prakash, Robert Wu, and Houman Khosravani

PDF

Open Access 1 Repo 1 Models

TL;DR

SEE-2-SOUND is a zero-shot framework that generates high-quality spatial audio aligned with visual content, enhancing immersive multimedia experiences by integrating spatial cues into generated audio.

Contribution

It introduces a novel zero-shot method for decomposing visual scenes and generating spatial audio without prior training on specific datasets.

Findings

01

Effective spatial audio generation for videos and images

02

Supports high-quality, dynamic internet media

03

Works across various visual content types

Abstract

Generating combined visual and auditory sensory experiences is critical for the consumption of immersive content. Recent advances in neural generative models have enabled the creation of high-resolution content across multiple modalities such as images, text, speech, and videos. Despite these successes, there remains a significant gap in the generation of high-quality spatial audio that complements generated visual content. Furthermore, current audio generation models excel in either generating natural audio or speech or music but fall short in integrating spatial audio cues necessary for immersive experiences. In this work, we introduce SEE-2-SOUND, a zero-shot approach that decomposes the task into (1) identifying visual regions of interest; (2) locating these elements in 3D space; (3) generating mono-audio for each; and (4) integrating them into spatial audio. Using our framework, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

see2sound/see2sound
jaxOfficial

Models

🤗
rishitdagli/see-2-sound
model· 3 dl· ♡ 7
3 dl♡ 7

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Acoustic Wave Phenomena Research