Generation-Step-Aware Framework for Cross-Modal Representation and Control in Multilingual Speech-Text Models

Toshiki Nakai; Varsha Suresh; and Vera Demberg

arXiv:2601.17387·cs.CL·April 3, 2026

Generation-Step-Aware Framework for Cross-Modal Representation and Control in Multilingual Speech-Text Models

Toshiki Nakai, Varsha Suresh, and Vera Demberg

PDF

TL;DR

This paper introduces a framework to analyze cross-modal language processing in multilingual speech-text models, revealing time- and function-dependent shared and modality-specific neural mechanisms.

Contribution

It provides a novel generation-step-aware method to dissect and compare language representation and control neurons across modalities in speech-text models.

Findings

01

Cross-modal language alignment is strongest at the first decoding step.

02

Shared language-representation neurons weaken as generation proceeds.

03

Language-control neurons transfer causally from speech to text, especially at later steps.

Abstract

Multilingual speech-text models rely on cross-modal language alignment to transfer knowledge between speech and text, but it remains unclear whether this reflects shared computation for the same language or modality-specific processing. We introduce a generation-step-aware framework for evaluating cross-modal computation that (i) identifies language-selective neurons for each modality at different decoding steps, (ii) decomposes them into language-representation and language-control roles, and (iii) enables cross-modal comparison via overlap measures and causal intervention, including cross-modal steering of output language. Applying our framework to SeamlessM4T v2, we find that cross-modal language alignment is strongest at the first decoding step, where language-representation neurons are shared across modalities, but weakens as generation proceeds, indicating a shift toward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.