U-SAM: An audio language Model for Unified Speech, Audio, and Music Understanding

Ziqian Wang; Xianjun Xia; Xinfa Zhu; Lei Xie

arXiv:2505.13880·eess.AS·May 28, 2025

U-SAM: An audio language Model for Unified Speech, Audio, and Music Understanding

Ziqian Wang, Xianjun Xia, Xinfa Zhu, Lei Xie

PDF

Open Access 1 Repo

TL;DR

U-SAM is a unified audio language model that effectively integrates speech, audio, and music understanding using specialized encoders, a large language model, and novel loss functions to improve cross-modal alignment and generalization.

Contribution

The paper introduces U-SAM, a novel audio language model that combines domain-specific encoders, a Mixture of Experts projector, and a Semantic-Aware Contrastive Loss for improved multi-domain audio understanding.

Findings

01

U-SAM outperforms existing models on multiple benchmarks.

02

It demonstrates strong generalization to unseen audio tasks.

03

The model achieves better cross-modal alignment through its loss design.

Abstract

The text generation paradigm for audio tasks has opened new possibilities for unified audio understanding. However, existing models face significant challenges in achieving a comprehensive understanding across diverse audio types, such as speech, general audio events, and music. Furthermore, their exclusive reliance on cross-entropy loss for alignment often falls short, as it treats all tokens equally and fails to account for redundant audio features, leading to weaker cross-modal alignment. To deal with the above challenges, this paper introduces U-SAM, an advanced audio language model that integrates specialized encoders for speech, audio, and music with a pre-trained large language model (LLM). U-SAM employs a Mixture of Experts (MoE) projector for task-aware feature fusion, dynamically routing and integrating the domain-specific encoder outputs. Additionally, U-SAM incorporates a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

honee-w/u-sam
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis