The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models
Yuhuan You, Lai Wei, Xihong Wu, Tianshu Qu

TL;DR
This paper introduces TWNM, a framework that enhances large audio-language models with explicit spatial understanding, enabling advanced scene analysis and reasoning in complex auditory environments.
Contribution
It formalizes audio scene analysis as a three-level problem and proposes a novel framework using physically grounded spatial representations and curriculum training.
Findings
Achieved 70.8% overall accuracy on the ASA benchmark.
Demonstrated effective spatial attribute binding and scene reasoning.
Provided diagnostic references with explicit spatial labels.
Abstract
Large audio-language models have made rapid progress in recognizing what is present in an audio clip, but spatial audio-language understanding still lacks a clear task interface. A model must also decide where sound events occur, which semantic and spatial attributes belong to the same auditory object, how multiple objects are arranged, and whether a scene-level answer is physically plausible. We formalize this capability as audio scene analysis (ASA), a three-level problem spanning atomic perception, relational integration, and cognitive reasoning. We propose The World is Not Mono (TWNM), a framework that equips audio-language models with explicit spatial evidence. TWNM uses physically grounded First-Order Ambisonics (FOA) simulation for controllable supervision, learns slot-regularized spatial representations from multichannel audio, fuses them with semantic audio features, and trains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
