TL;DR
SonoWorld is a novel framework that generates a 3D audio-visual scene from a single image, combining scene outpainting, 3D scene construction, sound placement, and spatial audio rendering.
Contribution
It introduces the first pipeline to create immersive 3D audio-visual scenes from a single image, integrating spatial audio aligned with scene geometry and semantics.
Findings
Outperforms baselines on a real-world dataset
User study confirms improved immersion and spatial accuracy
Enables applications like acoustic learning and source separation
Abstract
Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360{\deg} panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
