Probing Cross-modal Information Hubs in Audio-Visual LLMs

Jihoo Jung; Chaeyoung Jung; Ji-Hoon Kim; Joon Son Chung

arXiv:2605.10815·cs.AI·May 13, 2026

Probing Cross-modal Information Hubs in Audio-Visual LLMs

Jihoo Jung, Chaeyoung Jung, Ji-Hoon Kim, Joon Son Chung

PDF

1 Repo

TL;DR

This paper investigates how audio and visual information are internally represented in AVLLMs, revealing the role of sink tokens and proposing a simple method to improve their cross-modal reasoning.

Contribution

It uncovers the internal encoding patterns of cross-modal information in AVLLMs and introduces a training-free mitigation technique based on these insights.

Findings

01

AVLLMs encode integrated audio-visual info mainly in sink tokens.

02

A subset of sink tokens, cross-modal sink tokens, store cross-modal info.

03

The proposed method improves AVLLMs' reliance on integrated cross-modal information.

Abstract

Audio-visual large language models (AVLLMs) have recently emerged as a powerful architecture capable of jointly reasoning over audio, visual, and textual modalities. In AVLLMs, the bidirectional interaction between audio and video modalities introduces intricate processing dynamics, necessitating a deeper understanding of their internal mechanisms. However, unlike extensively studied text-only or large vision language models, the internal workings of AVLLMs remain largely unexplored. In this paper, we focus on cross-modal information flow between audio and visual modalities in AVLLMs, investigating where information derived from one modality is encoded within the token representations of the other modality. Through an analysis of multiple recent AVLLMs, we uncover two common findings. First, AVLLMs primarily encode integrated audio-visual information in sink tokens. Second, sink tokens…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kaistmm/crossmodal-hub
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.