AlignFormer: Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM
Ruchao Fan, Bo Ren, Yuxuan Hu, Rui Zhao, Shujie Liu, and Jinyu Li

TL;DR
AlignFormer introduces a novel neural adapter to better match speech and text modalities in speech-LLMs, enabling improved zero-shot instruction-following and task performance by reducing sequence length mismatch.
Contribution
The paper proposes AlignFormer, a new neural adapter with CTC and dynamic-window QFormer layers, to effectively align speech and text modalities for speech-LLMs, especially in zero-shot tasks.
Findings
AlignFormer achieves near 100% instruction following rate with audio-first training.
Audio-first training outperforms instruction-first training in instruction following capability.
Speech-LLM with AlignFormer can perform zero-shot speech translation and question answering.
Abstract
Integrating speech into LLM (speech-LLM) has gaining increased attention recently. The mainstream solution is to connect a well-trained speech encoder and LLM with a neural adapter. However, the length mismatch between the speech and text sequences are not well handled, leading to imperfect modality matching between the speech and text. In this work, we propose a novel neural adapter, AlignFormer, to reduce the length gap between the two modalities. AlignFormer consists of CTC and dynamic-window QFormer layers, where the CTC alignment provides the dynamic window information for QFormer. The LLM backbone is frozen in training to preserve its text capability, especially the instruction following capability. When training with ASR data only, the proposed AlignFormer unlocks the instruction following capability for speech-LLM and the model can perform zero-shot speech translation (ST) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsSoftmax · Attention Is All You Need
