Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs
Ming-Hao Hsu, Xueyao Zhang, Xiaohai Tian, Jun Zhang, Zhizheng Wu

TL;DR
This paper investigates the internal layer-wise dynamics of end-to-end speech-language models to understand the persistent modality gap, revealing that speech representations are broadly aligned across layers and that simple calibration methods are ineffective.
Contribution
It provides a detailed layer-wise analysis of speech and text representations in speech-language models, highlighting the structural stability of alignment patterns and the limitations of current calibration techniques.
Findings
Speech representations show a broad cross-layer alignment band.
Alignment patterns are stable across different analysis configurations.
Simple statistical calibration can be harmful when applied at the input layer.
Abstract
Recent advancements in Large Speech-Language Models have significantly bridged the gap between acoustic signals and linguistic understanding. However, a persistent performance disparity remains in speech-based input tasks compared to direct text inference. In this paper, we investigate the dynamic roots of this modality gap beyond static geometric alignment, analyzing how speech and text representations evolve layer-by-layer. We evaluate four open-weight end-to-end models on SpeechMMLU and VoiceBench BBH. Using cross-layer CKA analysis with speech-text token alignment, we find that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech where semantic content spans multiple frames. We show that these alignment patterns are structurally stable across different analysis configurations. Crucially, simple statistical calibration is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems
