EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs
Chao Gong, Depeng Wang, Zhipeng Wei, Ya Guo, Huijia Zhu, Jingjing Chen

TL;DR
EchoingPixels introduces a cross-modal token reduction framework for audio-visual large language models, enabling adaptive, joint audio-visual token selection that maintains performance while significantly reducing computational costs.
Contribution
It proposes a novel cross-modal semantic sieve (CS2) for adaptive token reduction across audio and visual modalities, addressing limitations of unimodal methods and static budgets.
Findings
Achieves comparable performance with only 5-20% of original tokens.
Provides 2-3x speedup and memory savings.
Effectively preserves temporal relationships with Sync-RoPE.
Abstract
Audio-Visual Large Language Models (AV-LLMs) face prohibitive computational overhead from massive audio and video tokens. Token reduction, while extensively explored for video-only LLMs, is insufficient for the audio-visual domain, as these unimodal methods cannot leverage audio-visual cross-modal synergies. Furthermore, the distinct and dynamic information densities of audio and video render static budgets per modality suboptimal. How to perform token reduction on a joint audio-visual stream thus remains an unaddressed bottleneck. To fill this gap, we introduce EchoingPixels, a framework inspired by the coexistence and interaction of visuals and sound in real-world scenes. The core of our framework is the Cross-Modal Semantic Sieve (CS2), a module enabling early audio-visual interaction. Instead of compressing modalities independently, CS2 co-attends to the joint multimodal stream and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Speech and Audio Processing
