SonoWorld: From One Image to a 3D Audio-Visual Scene

Derong Jin; Xiyi Chen; Ming C. Lin; Ruohan Gao

arXiv:2603.28757·cs.CV·March 31, 2026

SonoWorld: From One Image to a 3D Audio-Visual Scene

Derong Jin, Xiyi Chen, Ming C. Lin, Ruohan Gao

PDF

1 Repo

TL;DR

SonoWorld is a novel framework that generates a 3D audio-visual scene from a single image, combining scene outpainting, 3D scene construction, sound placement, and spatial audio rendering.

Contribution

It introduces the first pipeline to create immersive 3D audio-visual scenes from a single image, integrating spatial audio aligned with scene geometry and semantics.

Findings

01

Outperforms baselines on a real-world dataset

02

User study confirms improved immersion and spatial accuracy

03

Enables applications like acoustic learning and source separation

Abstract

Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360{\deg} panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://humathe.github.io/sonoworld
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.