SALM: Spatial Audio Language Model with Structured Embeddings for Understanding and Editing
Jinbo Hu, Yin Cao, Ming Wu, Zhenbo Luo, Jun Yang

TL;DR
SALM is a novel multi-modal framework that effectively aligns spatial audio with natural language, enabling understanding, interpretation, and editing of acoustic scenes through structured embeddings and contrastive learning.
Contribution
SALM introduces a new structured embedding approach and multi-modal contrastive learning to improve spatial audio and language integration for understanding and editing.
Findings
Effective cross-modal alignment of spatial audio and language
Zero-shot direction classification capability
Supports flexible spatial audio editing using text embeddings
Abstract
Spatial audio understanding is essential for accurately perceiving and interpreting acoustic environments. However, existing audio-language models exhibit limitations in processing spatial audio and perceiving spatial acoustic scenes. To address this gap, we propose the Spatial Audio Language Model (SALM), a novel framework that bridges spatial audio and language through multi-modal contrastive learning. SALM integrates a text encoder with a dual-branch audio encoder that decomposes spatial sound into semantic and spatial components via structured audio embeddings. Key features of SALM include seamless alignment between spatial audio and natural language, both separate and joint extraction of spatial and semantic representations, zero-shot direction classification, and flexible support for spatial audio editing. Experimental results demonstrate that SALM effectively captures and aligns…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies
