Causal Tracing of Audio-Text Fusion in Large Audio Language Models
Wei-Chih Chen, Chien-yu Huang, Hung-yi Lee

TL;DR
This paper uses causal tracing to analyze how large audio language models integrate acoustic and textual information internally, revealing different fusion strategies and key information bottlenecks during audio comprehension.
Contribution
It introduces a causal tracing approach to dissect internal information flow in LALMs, uncovering distinct fusion mechanisms and the role of specific tokens in audio-text integration.
Findings
Different models exhibit varied fusion strategies, from progressive to late-stage.
The final token acts as an informational bottleneck for relevant audio retrieval.
Intermediate tokens trigger attention-like mechanisms for task-relevant context.
Abstract
Despite the strong performance of large audio language models (LALMs) in various tasks, exactly how and where they integrate acoustic features with textual context remains unclear. We adapt causal tracing to investigate the internal information flow of LALMs during audio comprehension. By conducting layer-wise and token-wise analyses across DeSTA, Qwen, and Voxtral, we evaluate the causal effects of individual hidden states. Layer-wise analysis identifies different fusion strategies, from progressive integration in DeSTA to abrupt late-stage fusion in Qwen. Token-wise analysis shows that the final sequence token acts as an informational bottleneck where the network decisively retrieves relevant information from the audio. We also observe an attention-like query mechanism at intermediate token positions that triggers the model to pull task-relevant audio context. These findings provide a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
