Physics-Aware Novel-View Acoustic Synthesis with Vision-Language Priors and 3D Acoustic Environment Modeling

Congyi Fan; Jian Guan; Youtian Lin; Dongli Xu; Tong Ye; Qiaoxi Zhu; Pengming Feng; Wenwu Wang

arXiv:2601.19712·cs.SD·January 28, 2026

Physics-Aware Novel-View Acoustic Synthesis with Vision-Language Priors and 3D Acoustic Environment Modeling

Congyi Fan, Jian Guan, Youtian Lin, Dongli Xu, Tong Ye, Qiaoxi Zhu, Pengming Feng, Wenwu Wang

PDF

Open Access

TL;DR

Phys-NVAS introduces a physics-aware framework for novel-view acoustic synthesis that combines 3D environment modeling with vision-language semantic priors, significantly improving spatial audio realism and physical consistency.

Contribution

This work is the first to integrate physical environment modeling with vision-language priors for NVAS, enhancing spatial and semantic sound synthesis capabilities.

Findings

01

Improved realism in binaural audio generation.

02

Enhanced physical consistency of synthesized spatial sounds.

03

Effective integration of geometry and semantic cues.

Abstract

Spatial audio is essential for immersive experiences, yet novel-view acoustic synthesis (NVAS) remains challenging due to complex physical phenomena such as reflection, diffraction, and material absorption. Existing methods based on single-view or panoramic inputs improve spatial fidelity but fail to capture global geometry and semantic cues such as object layout and material properties. To address this, we propose Phys-NVAS, the first physics-aware NVAS framework that integrates spatial geometry modeling with vision-language semantic priors. A global 3D acoustic environment is reconstructed from multi-view images and depth maps to estimate room size and shape, enhancing spatial awareness of sound propagation. Meanwhile, a vision-language model extracts physics-aware priors of objects, layouts, and materials, capturing absorption and reflection beyond geometry. An acoustic feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHearing Loss and Rehabilitation · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis