TL;DR
This paper investigates how audio and visual information are internally represented in AVLLMs, revealing the role of sink tokens and proposing a simple method to improve their cross-modal reasoning.
Contribution
It uncovers the internal encoding patterns of cross-modal information in AVLLMs and introduces a training-free mitigation technique based on these insights.
Findings
AVLLMs encode integrated audio-visual info mainly in sink tokens.
A subset of sink tokens, cross-modal sink tokens, store cross-modal info.
The proposed method improves AVLLMs' reliance on integrated cross-modal information.
Abstract
Audio-visual large language models (AVLLMs) have recently emerged as a powerful architecture capable of jointly reasoning over audio, visual, and textual modalities. In AVLLMs, the bidirectional interaction between audio and video modalities introduces intricate processing dynamics, necessitating a deeper understanding of their internal mechanisms. However, unlike extensively studied text-only or large vision language models, the internal workings of AVLLMs remain largely unexplored. In this paper, we focus on cross-modal information flow between audio and visual modalities in AVLLMs, investigating where information derived from one modality is encoded within the token representations of the other modality. Through an analysis of multiple recent AVLLMs, we uncover two common findings. First, AVLLMs primarily encode integrated audio-visual information in sink tokens. Second, sink tokens…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
