Can Large Language Models Understand Spatial Audio?

Changli Tang; Wenyi Yu; Guangzhi Sun; Xianzhao Chen; Tian Tan; Wei Li,; Jun Zhang; Lu Lu; Zejun Ma; Yuxuan Wang; Chao Zhang

arXiv:2406.07914·cs.SD·June 17, 2024

Can Large Language Models Understand Spatial Audio?

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li,, Jun Zhang, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang

PDF

Open Access

TL;DR

This paper demonstrates that large language models can be adapted to understand spatial audio, improving tasks like sound localization and speech recognition in 3D environments, thus enabling more intelligent audio-aware agents.

Contribution

The paper introduces methods for LLMs to interpret spatial audio cues, significantly improving localization accuracy and enabling new audio-based interaction capabilities.

Findings

01

Achieved 2.70° MAE in sound source localization, surpassing previous benchmarks.

02

Enhanced far-field speech recognition accuracy using spatial cues.

03

Enabled LSE with text prompts to focus on specific sound directions.

Abstract

This paper explores enabling large language models (LLMs) to understand spatial information from multichannel audio, a skill currently lacking in auditory LLMs. By leveraging LLMs' advanced cognitive and inferential abilities, the aim is to enhance understanding of 3D environments via audio. We study 3 spatial audio tasks: sound source localization (SSL), far-field speech recognition (FSR), and localisation-informed speech extraction (LSE), achieving notable progress in each task. For SSL, our approach achieves an MAE of $2.7 0^{\circ}$ on the Spatial LibriSpeech dataset, substantially surpassing the prior benchmark of about $6.6 0^{\circ}$ . Moreover, our model can employ spatial cues to improve FSR accuracy and execute LSE by selectively attending to sounds originating from a specified direction via text prompts, even amidst overlapping speech. These findings highlight the potential of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Computational and Text Analysis Methods