The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models

Yuhuan You; Lai Wei; Xihong Wu; Tianshu Qu

arXiv:2601.02954·cs.SD·May 12, 2026

The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models

Yuhuan You, Lai Wei, Xihong Wu, Tianshu Qu

PDF

1 Datasets

TL;DR

This paper introduces TWNM, a framework that enhances large audio-language models with explicit spatial understanding, enabling advanced scene analysis and reasoning in complex auditory environments.

Contribution

It formalizes audio scene analysis as a three-level problem and proposes a novel framework using physically grounded spatial representations and curriculum training.

Findings

01

Achieved 70.8% overall accuracy on the ASA benchmark.

02

Demonstrated effective spatial attribute binding and scene reasoning.

03

Provided diagnostic references with explicit spatial labels.

Abstract

Large audio-language models have made rapid progress in recognizing what is present in an audio clip, but spatial audio-language understanding still lacks a clear task interface. A model must also decide where sound events occur, which semantic and spatial attributes belong to the same auditory object, how multiple objects are arranged, and whether a scene-level answer is physically plausible. We formalize this capability as audio scene analysis (ASA), a three-level problem spanning atomic perception, relational integration, and cognitive reasoning. We propose The World is Not Mono (TWNM), a framework that equips audio-language models with explicit spatial evidence. TWNM uses physically grounded First-Order Ambisonics (FOA) simulation for controllable supervision, learns slot-regularized spatial representations from multichannel audio, fuses them with semantic audio features, and trains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

shuitata618/twnm-benchmark-foa
dataset· 371 dl
371 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.