SSR: Alignment-Aware Modality Connector for Speech Language Models

Weiting Tan; Hirofumi Inaguma; Ning Dong; Paden Tomasello; Xutai Ma

arXiv:2410.00168·cs.CL·May 20, 2025

SSR: Alignment-Aware Modality Connector for Speech Language Models

Weiting Tan, Hirofumi Inaguma, Ning Dong, Paden Tomasello, Xutai Ma

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SSR-Connector, a novel method for fusing speech with pre-trained language models that improves speech understanding accuracy and mitigates catastrophic forgetting through segmentation, compression, and a two-stage training process.

Contribution

The paper presents SSR-Connector, a new alignment-aware modality connector that enhances speech-text fusion by leveraging alignments and a specialized training pipeline.

Findings

01

+10 accuracy on StoryCloze

02

+20 accuracy on Speech-MMLU

03

Better speech understanding and preservation of pre-trained text abilities

Abstract

Fusing speech into pre-trained language model (SpeechLM) usually suffers from inefficient encoding of long-form speech and catastrophic forgetting of pre-trained text modality. We propose SSR-Connector (Segmented Speech Representation Connector) for better modality fusion. Leveraging speech-text alignments, our approach segments and compresses speech features to match the granularity of text embeddings. Additionally, we introduce a two-stage training pipeline that includes the distillation and fine-tuning phases to mitigate catastrophic forgetting. SSR-Connector outperforms existing mechanism for speech-text modality fusion, consistently achieving better speech understanding (e.g., +10 accuracy on StoryCloze and +20 on Speech-MMLU) while preserving pre-trained text ability.

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 8Confidence 4

Strengths

The proposed method has some connections with the previous methods in speech downsampling and model distillation but is novel enough. The evaluation is very thorough.

Weaknesses

Although the paper acknowledges the three categories of connector designs in Figure 1, it does not compare enough to Figure 1(a). Compared with the proposed 2-stage method, a single-stage or 2-stage baseline with a much simpler 1(a) design can motivate the more complex design in this paper. Some experiment comparisons and explanations can be improved (see questions).

Reviewer 02Rating 3Confidence 4

Strengths

The proposed method outperforms existing speech fusion mechanisms in terms of speech understanding.

Weaknesses

The cost-performance ratio of the proposed method is lower when compared to a cascaded ASR+LLM system. Please refer to the questions section regarding my queries.

Reviewer 03Rating 5Confidence 4

Strengths

1. The paper is well-structured and systematically describes each component of the SSR-Connector. Additionally, it clearly explains the methodology used to prevent catastrophic forgetting and offers comparisons with prior models. 2. The SSR-Connector introduces a novel "alignment-aware" approach to speech-text modality fusion. 3. The paper provides a comprehensive analysis of the proposed SSR-Connector across a variety of benchmarks, including StoryCloze, sWUGGY, and Speech-MMLU.

Weaknesses

1. One significant aspect of SpeechLMs is their ability to capture information in speech beyond just the content. This has been analyzed in many related studies; for instance, in SALMONN [1] and Qwen2-Audio [2], paralinguistic information, such as emotion, is treated as an important evaluation aspect, and SpiritLM also examines capabilities related to understanding paralinguistic information. In addition, as the authors mention in the introduction, cascade systems are unable to learn paralinguis

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling