Generation-Step-Aware Framework for Cross-Modal Representation and Control in Multilingual Speech-Text Models
Toshiki Nakai, Varsha Suresh, and Vera Demberg

TL;DR
This paper introduces a framework to analyze cross-modal language processing in multilingual speech-text models, revealing time- and function-dependent shared and modality-specific neural mechanisms.
Contribution
It provides a novel generation-step-aware method to dissect and compare language representation and control neurons across modalities in speech-text models.
Findings
Cross-modal language alignment is strongest at the first decoding step.
Shared language-representation neurons weaken as generation proceeds.
Language-control neurons transfer causally from speech to text, especially at later steps.
Abstract
Multilingual speech-text models rely on cross-modal language alignment to transfer knowledge between speech and text, but it remains unclear whether this reflects shared computation for the same language or modality-specific processing. We introduce a generation-step-aware framework for evaluating cross-modal computation that (i) identifies language-selective neurons for each modality at different decoding steps, (ii) decomposes them into language-representation and language-control roles, and (iii) enables cross-modal comparison via overlap measures and causal intervention, including cross-modal steering of output language. Applying our framework to SeamlessM4T v2, we find that cross-modal language alignment is strongest at the first decoding step, where language-representation neurons are shared across modalities, but weakens as generation proceeds, indicating a shift toward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
