SALM: Spatial Audio Language Model with Structured Embeddings for Understanding and Editing

Jinbo Hu; Yin Cao; Ming Wu; Zhenbo Luo; Jun Yang

arXiv:2507.16724·cs.SD·September 19, 2025

SALM: Spatial Audio Language Model with Structured Embeddings for Understanding and Editing

Jinbo Hu, Yin Cao, Ming Wu, Zhenbo Luo, Jun Yang

PDF

Open Access

TL;DR

SALM is a novel multi-modal framework that effectively aligns spatial audio with natural language, enabling understanding, interpretation, and editing of acoustic scenes through structured embeddings and contrastive learning.

Contribution

SALM introduces a new structured embedding approach and multi-modal contrastive learning to improve spatial audio and language integration for understanding and editing.

Findings

01

Effective cross-modal alignment of spatial audio and language

02

Zero-shot direction classification capability

03

Supports flexible spatial audio editing using text embeddings

Abstract

Spatial audio understanding is essential for accurately perceiving and interpreting acoustic environments. However, existing audio-language models exhibit limitations in processing spatial audio and perceiving spatial acoustic scenes. To address this gap, we propose the Spatial Audio Language Model (SALM), a novel framework that bridges spatial audio and language through multi-modal contrastive learning. SALM integrates a text encoder with a dual-branch audio encoder that decomposes spatial sound into semantic and spatial components via structured audio embeddings. Key features of SALM include seamless alignment between spatial audio and natural language, both separate and joint extraction of spatial and semantic representations, zero-shot direction classification, and flexible support for spatial audio editing. Experimental results demonstrate that SALM effectively captures and aligns…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies