EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs

Chao Gong; Depeng Wang; Zhipeng Wei; Ya Guo; Huijia Zhu; Jingjing Chen

arXiv:2512.10324·cs.CV·December 12, 2025

EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs

Chao Gong, Depeng Wang, Zhipeng Wei, Ya Guo, Huijia Zhu, Jingjing Chen

PDF

Open Access

TL;DR

EchoingPixels introduces a cross-modal token reduction framework for audio-visual large language models, enabling adaptive, joint audio-visual token selection that maintains performance while significantly reducing computational costs.

Contribution

It proposes a novel cross-modal semantic sieve (CS2) for adaptive token reduction across audio and visual modalities, addressing limitations of unimodal methods and static budgets.

Findings

01

Achieves comparable performance with only 5-20% of original tokens.

02

Provides 2-3x speedup and memory savings.

03

Effectively preserves temporal relationships with Sync-RoPE.

Abstract

Audio-Visual Large Language Models (AV-LLMs) face prohibitive computational overhead from massive audio and video tokens. Token reduction, while extensively explored for video-only LLMs, is insufficient for the audio-visual domain, as these unimodal methods cannot leverage audio-visual cross-modal synergies. Furthermore, the distinct and dynamic information densities of audio and video render static budgets per modality suboptimal. How to perform token reduction on a joint audio-visual stream thus remains an unaddressed bottleneck. To fill this gap, we introduce EchoingPixels, a framework inspired by the coexistence and interaction of visuals and sound in real-world scenes. The core of our framework is the Cross-Modal Semantic Sieve (CS2), a module enabling early audio-visual interaction. Instead of compressing modalities independently, CS2 co-attends to the joint multimodal stream and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Speech and Audio Processing