Connecting Speech Encoder and Large Language Model for ASR
Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian, Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang

TL;DR
This study compares connector structures for integrating speech encoders with large language models in ASR, demonstrating that Q-Former structures significantly improve word error rates and generalize well to out-of-domain data.
Contribution
The paper introduces a comparative analysis of connector structures and proposes a novel segment-level Q-Former for improved long speech recognition in ASR.
Findings
Q-Formers outperform other connectors in WER reduction.
Q-Formers generalize well to out-of-domain datasets.
Segment-level Q-Former improves recognition of long speech segments.
Abstract
The impressive capability and versatility of large language models (LLMs) have aroused increasing attention in automatic speech recognition (ASR), with several pioneering studies attempting to build integrated ASR models by connecting a speech encoder with an LLM. This paper presents a comparative study of three commonly used structures as connectors, including fully connected layers, multi-head cross-attention, and Q-Former. Speech encoders from the Whisper model series as well as LLMs from the Vicuna model series with different model sizes were studied. Experiments were performed on the commonly used LibriSpeech, Common Voice, and GigaSpeech datasets, where the LLMs with Q-Formers demonstrated consistent and considerable word error rate (WER) reductions over LLMs with other connector structures. Q-Former-based LLMs can generalise well to out-of-domain datasets, where 12% relative WER…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
