Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders

Weiqiao Shan; Yuang Li; Yuhao Zhang; Yingfeng Luo; Chen Xu; Xiaofeng Zhao; Long Meng; Yunfei Lu; Min Zhang; Hao Yang; Tong Xiao; Jingbo Zhu

arXiv:2502.15178·eess.AS·September 22, 2025

Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders

Weiqiao Shan, Yuang Li, Yuhao Zhang, Yingfeng Luo, Chen Xu, Xiaofeng Zhao, Long Meng, Yunfei Lu, Min Zhang, Hao Yang, Tong Xiao, Jingbo Zhu

PDF

1 Video

TL;DR

This paper introduces Prompt-aware Mixture (PaM), a method that combines multiple audio encoders with large language models, enabling improved performance across various audio understanding tasks by dynamically selecting task-specific features.

Contribution

The paper proposes a novel prompt-aware mixture approach that uses multiple audio encoders to extract task-specific features, outperforming single-encoder models and traditional fusion methods.

Findings

01

PaM enables a single Speech LLM to surpass all single-encoder models on multiple tasks.

02

PaM outperforms concatenation and averaging baselines in feature fusion.

03

The approach improves performance on ASR, speaker verification, and audio captioning tasks.

Abstract

Connecting audio encoders with large language models (LLMs) allows the LLM to perform various audio understanding tasks, such as automatic speech recognition (ASR) and audio captioning (AC). Most research focuses on training an adapter layer to generate a unified audio feature for the LLM. However, different tasks may require distinct features that emphasize either semantic or acoustic aspects, making task-specific audio features more desirable. In this paper, we propose Prompt-aware Mixture (PaM) to enhance the Speech LLM that uses multiple audio encoders. Our approach involves using different experts to extract different features based on the prompt that indicates different tasks. Experiments demonstrate that with PaM, only one Speech LLM surpasses the best performances achieved by all single-encoder Speech LLMs on ASR, Speaker Number Verification, and AC tasks. PaM also outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders· underline

Taxonomy

MethodsAdapter