Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?
Luca Modica, Filip Landin, Mehrdad Farahani, Livia Qian, Gabriel Skantze, Richard Johansson

TL;DR
This paper investigates whether factual recall mechanisms in multimodal speech language models are consistent across text and speech modalities, revealing partial transfer and new insights into internal encoding processes.
Contribution
It applies causal mediation analysis to multimodal models to compare factual recall mechanisms across modalities, highlighting differences and similarities.
Findings
Discrepancies found between text-to-text and speech-to-text factual recall mechanisms.
Emergent mechanisms for factual recall are only partially transferred from text to speech.
Provides insights for improving speech-enabled AI systems.
Abstract
In recent years, several Speech Language Models (SLMs) that represent speech and written text jointly have been presented. The question then emerges about how model-internal mechanisms are similar and different when operating in the two modalities. We focus on how these systems encode, store, and retrieve factual knowledge, which has previously been investigated for text-only models. To investigate mechanisms behind the storage and recall of factual association in SLMs, we leverage Causal Mediation Analysis, a technique previously applied to text-based models. Initial results using SpiritLM, a multimodal model integrating discrete speech tokens reveal discrepancies between text-to-text and speech-to-text results, suggesting that the emergent mechanisms for factual recall are only partially carried over from the text to the speech modality. These results advance our understanding of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
