Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs

Ming-Hao Hsu; Xueyao Zhang; Xiaohai Tian; Jun Zhang; Zhizheng Wu

arXiv:2603.01502·cs.CL·March 3, 2026

Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs

Ming-Hao Hsu, Xueyao Zhang, Xiaohai Tian, Jun Zhang, Zhizheng Wu

PDF

Open Access

TL;DR

This paper investigates the internal layer-wise dynamics of end-to-end speech-language models to understand the persistent modality gap, revealing that speech representations are broadly aligned across layers and that simple calibration methods are ineffective.

Contribution

It provides a detailed layer-wise analysis of speech and text representations in speech-language models, highlighting the structural stability of alignment patterns and the limitations of current calibration techniques.

Findings

01

Speech representations show a broad cross-layer alignment band.

02

Alignment patterns are stable across different analysis configurations.

03

Simple statistical calibration can be harmful when applied at the input layer.

Abstract

Recent advancements in Large Speech-Language Models have significantly bridged the gap between acoustic signals and linguistic understanding. However, a persistent performance disparity remains in speech-based input tasks compared to direct text inference. In this paper, we investigate the dynamic roots of this modality gap beyond static geometric alignment, analyzing how speech and text representations evolve layer-by-layer. We evaluate four open-weight end-to-end models on SpeechMMLU and VoiceBench BBH. Using cross-layer CKA analysis with speech-text token alignment, we find that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech where semantic content spans multiple frames. We show that these alignment patterns are structurally stable across different analysis configurations. Crucially, simple statistical calibration is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems