U-SAM: An audio language Model for Unified Speech, Audio, and Music Understanding
Ziqian Wang, Xianjun Xia, Xinfa Zhu, Lei Xie

TL;DR
U-SAM is a unified audio language model that effectively integrates speech, audio, and music understanding using specialized encoders, a large language model, and novel loss functions to improve cross-modal alignment and generalization.
Contribution
The paper introduces U-SAM, a novel audio language model that combines domain-specific encoders, a Mixture of Experts projector, and a Semantic-Aware Contrastive Loss for improved multi-domain audio understanding.
Findings
U-SAM outperforms existing models on multiple benchmarks.
It demonstrates strong generalization to unseen audio tasks.
The model achieves better cross-modal alignment through its loss design.
Abstract
The text generation paradigm for audio tasks has opened new possibilities for unified audio understanding. However, existing models face significant challenges in achieving a comprehensive understanding across diverse audio types, such as speech, general audio events, and music. Furthermore, their exclusive reliance on cross-entropy loss for alignment often falls short, as it treats all tokens equally and fails to account for redundant audio features, leading to weaker cross-modal alignment. To deal with the above challenges, this paper introduces U-SAM, an advanced audio language model that integrates specialized encoders for speech, audio, and music with a pre-trained large language model (LLM). U-SAM employs a Mixture of Experts (MoE) projector for task-aware feature fusion, dynamically routing and integrating the domain-specific encoder outputs. Additionally, U-SAM incorporates a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
