MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free

Yishu Lei; Shuwei He; Jing Hu; Dan Zhang; Xianlong Luo; Danxiang Zhu; Shikun Feng; Rui Liu; Jingzhou He; Yu Sun; Hua Wu; Haifeng Wang

arXiv:2601.02967·cs.SD·January 9, 2026

MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free

Yishu Lei, Shuwei He, Jing Hu, Dan Zhang, Xianlong Luo, Danxiang Zhu, Shikun Feng, Rui Liu, Jingzhou He, Yu Sun, Hua Wu, Haifeng Wang

PDF

Open Access 1 Models

TL;DR

This paper introduces the MoE-Adapter, a sparse Mixture-of-Experts architecture for large audio language models that disentangles heterogeneous acoustic attributes, reduces gradient conflicts, and improves performance on audio tasks.

Contribution

It proposes a novel MoE-Adapter with dynamic gating to decouple acoustic features, addressing gradient conflicts in multimodal audio models.

Findings

01

Outperforms dense baselines on audio semantic tasks

02

Reduces gradient conflicts in training

03

Maintains comparable computational costs

Abstract

Extending the input modality of Large Language Models~(LLMs) to the audio domain is essential for achieving comprehensive multimodal perception. However, it is well-known that acoustic information is intrinsically \textit{heterogeneous}, entangling attributes such as speech, music, and environmental context. Existing research is limited to a dense, parameter-shared adapter to model these diverse patterns, which induces \textit{gradient conflict} during optimization, as parameter updates required for distinct attributes contradict each other. To address this limitation, we introduce the \textit{\textbf{MoE-Adapter}}, a sparse Mixture-of-Experts~(MoE) architecture designed to decouple acoustic information. Specifically, it employs a dynamic gating mechanism that routes audio tokens to specialized experts capturing complementary feature subspaces while retaining shared experts for global…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
cslys1999/Eureka-Audio-Instruct
model· 193 dl· ♡ 6
193 dl♡ 6

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Topic Modeling