LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors

Rao Ma; Tongzhou Chen; Kartik Audhkhasi; Bhuvana Ramabhadran

arXiv:2505.11352·cs.CL·May 19, 2025

LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors

Rao Ma, Tongzhou Chen, Kartik Audhkhasi, Bhuvana Ramabhadran

PDF

Open Access

TL;DR

LegoSLM introduces a novel method to connect speech encoders with large language models using CTC posteriors, enhancing speech recognition and translation performance with modularity and domain adaptation capabilities.

Contribution

The paper presents LegoSLM, a new paradigm that bridges speech encoders and LLMs via CTC posteriors, enabling improved performance and flexibility in spoken language processing tasks.

Findings

01

Achieves 49% WERR over baseline on 8 MLS testsets.

02

Demonstrates modularity allowing zero-shot model switching.

03

Effective domain adaptation using decode-time temperature control.

Abstract

Recently, large-scale pre-trained speech encoders and Large Language Models (LLMs) have been released, which show state-of-the-art performance on a range of spoken language processing tasks including Automatic Speech Recognition (ASR). To effectively combine both models for better performance, continuous speech prompts, and ASR error correction have been adopted. However, these methods are prone to suboptimal performance or are inflexible. In this paper, we propose a new paradigm, LegoSLM, that bridges speech encoders and LLMs using the ASR posterior matrices. The speech encoder is trained to generate Connectionist Temporal Classification (CTC) posteriors over the LLM vocabulary, which are used to reconstruct pseudo-audio embeddings by computing a weighted sum of the LLM input embeddings. These embeddings are concatenated with text embeddings in the LLM input space. Using the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsSoftmax