TL;DR
OmniZip is a training-free, audio-guided token compression framework that accelerates omni-modal large language models by dynamically pruning video tokens based on salient audio cues, achieving significant speed and memory improvements.
Contribution
It introduces a novel, training-free method for joint audio-visual token compression that enhances inference speed and reduces memory without performance loss.
Findings
Achieves 3.42X inference speedup
Reduces memory usage by 1.4X
Maintains model performance without additional training
Abstract
Omnimodal large language models (OmniLLMs) have attracted increasing research attention of late towards unified audio-video understanding. However, the high computational cost of processing longer joint audio-video token sequences has become a key bottleneck. Existing token compression methods have not addressed the emerging need to jointly compress multimodal tokens. To bridge this gap, we present OmniZip, a training-free, audio-guided audio-visual token-compression framework that optimizes multimodal token representation and accelerates model inference. Specifically, OmniZip first identifies salient audio tokens, then computes an audio retention score for each time group to capture information density, thereby dynamically guiding video token pruning and preserving cues from audio anchors enhanced by cross-modal similarity. For each time window, OmniZip compresses the video tokens…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
