SPUR: A Plug-and-Play Framework for Integrating Spatial Audio Understanding and Reasoning into Large Audio-Language Models

S Sakshi; Vaibhavi Lokegaonkar; Neil Zhang; Ramani Duraiswami; Sreyan Ghosh; Dinesh Manocha; Lie Lu

arXiv:2511.06606·eess.AS·November 17, 2025

SPUR: A Plug-and-Play Framework for Integrating Spatial Audio Understanding and Reasoning into Large Audio-Language Models

S Sakshi, Vaibhavi Lokegaonkar, Neil Zhang, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha, Lie Lu

PDF

Open Access

TL;DR

SPUR is a lightweight framework that enhances large audio-language models with spatial perception capabilities, enabling better understanding of acoustic scenes by integrating spatial cues through minimal architectural modifications.

Contribution

We introduce SPUR, a plug-in method that equips LALMs with spatial awareness using a FOA encoder and a new spatial QA dataset, improving spatial reasoning without extensive retraining.

Findings

01

Enhanced spatial QA performance on SPUR-Set

02

Improved multi-speaker attribution accuracy

03

Preserved general audio understanding after fine-tuning

Abstract

Spatial perception is central to auditory intelligence, enabling accurate understanding of real-world acoustic scenes and advancing human-level perception of the world around us. While recent large audio-language models (LALMs) show strong reasoning over complex audios, most operate on monaural inputs and lack the ability to capture spatial cues such as direction, elevation, and distance. We introduce SPUR, a lightweight, plug-in approach that equips LALMs with spatial perception through minimal architectural changes. SPUR consists of: (i) a First-Order Ambisonics (FOA) encoder that maps (W, X, Y, Z) channels to rotation-aware, listener-centric spatial features, integrated into target LALMs via a multimodal adapter; and (ii) SPUR-Set, a spatial QA dataset combining open-source FOA recordings with controlled simulations, emphasizing relative direction, elevation, distance, and overlap…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Phonetics and Phonology Research