LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors
Rao Ma, Tongzhou Chen, Kartik Audhkhasi, Bhuvana Ramabhadran

TL;DR
LegoSLM introduces a novel method to connect speech encoders with large language models using CTC posteriors, enhancing speech recognition and translation performance with modularity and domain adaptation capabilities.
Contribution
The paper presents LegoSLM, a new paradigm that bridges speech encoders and LLMs via CTC posteriors, enabling improved performance and flexibility in spoken language processing tasks.
Findings
Achieves 49% WERR over baseline on 8 MLS testsets.
Demonstrates modularity allowing zero-shot model switching.
Effective domain adaptation using decode-time temperature control.
Abstract
Recently, large-scale pre-trained speech encoders and Large Language Models (LLMs) have been released, which show state-of-the-art performance on a range of spoken language processing tasks including Automatic Speech Recognition (ASR). To effectively combine both models for better performance, continuous speech prompts, and ASR error correction have been adopted. However, these methods are prone to suboptimal performance or are inflexible. In this paper, we propose a new paradigm, LegoSLM, that bridges speech encoders and LLMs using the ASR posterior matrices. The speech encoder is trained to generate Connectionist Temporal Classification (CTC) posteriors over the LLM vocabulary, which are used to reconstruct pseudo-audio embeddings by computing a weighted sum of the LLM input embeddings. These embeddings are concatenated with text embeddings in the LLM input space. Using the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsSoftmax
