SSR: Alignment-Aware Modality Connector for Speech Language Models
Weiting Tan, Hirofumi Inaguma, Ning Dong, Paden Tomasello, Xutai Ma

TL;DR
This paper introduces SSR-Connector, a novel method for fusing speech with pre-trained language models that improves speech understanding accuracy and mitigates catastrophic forgetting through segmentation, compression, and a two-stage training process.
Contribution
The paper presents SSR-Connector, a new alignment-aware modality connector that enhances speech-text fusion by leveraging alignments and a specialized training pipeline.
Findings
+10 accuracy on StoryCloze
+20 accuracy on Speech-MMLU
Better speech understanding and preservation of pre-trained text abilities
Abstract
Fusing speech into pre-trained language model (SpeechLM) usually suffers from inefficient encoding of long-form speech and catastrophic forgetting of pre-trained text modality. We propose SSR-Connector (Segmented Speech Representation Connector) for better modality fusion. Leveraging speech-text alignments, our approach segments and compresses speech features to match the granularity of text embeddings. Additionally, we introduce a two-stage training pipeline that includes the distillation and fine-tuning phases to mitigate catastrophic forgetting. SSR-Connector outperforms existing mechanism for speech-text modality fusion, consistently achieving better speech understanding (e.g., +10 accuracy on StoryCloze and +20 on Speech-MMLU) while preserving pre-trained text ability.
Peer Reviews
Decision·Submitted to ICLR 2025
The proposed method has some connections with the previous methods in speech downsampling and model distillation but is novel enough. The evaluation is very thorough.
Although the paper acknowledges the three categories of connector designs in Figure 1, it does not compare enough to Figure 1(a). Compared with the proposed 2-stage method, a single-stage or 2-stage baseline with a much simpler 1(a) design can motivate the more complex design in this paper. Some experiment comparisons and explanations can be improved (see questions).
The proposed method outperforms existing speech fusion mechanisms in terms of speech understanding.
The cost-performance ratio of the proposed method is lower when compared to a cascaded ASR+LLM system. Please refer to the questions section regarding my queries.
1. The paper is well-structured and systematically describes each component of the SSR-Connector. Additionally, it clearly explains the methodology used to prevent catastrophic forgetting and offers comparisons with prior models. 2. The SSR-Connector introduces a novel "alignment-aware" approach to speech-text modality fusion. 3. The paper provides a comprehensive analysis of the proposed SSR-Connector across a variety of benchmarks, including StoryCloze, sWUGGY, and Speech-MMLU.
1. One significant aspect of SpeechLMs is their ability to capture information in speech beyond just the content. This has been analyzed in many related studies; for instance, in SALMONN [1] and Qwen2-Audio [2], paralinguistic information, such as emotion, is treated as an important evaluation aspect, and SpiritLM also examines capabilities related to understanding paralinguistic information. In addition, as the authors mention in the introduction, cascade systems are unable to learn paralinguis
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
