OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance
Yeo Jeong Park, Hyemi Jang, Minseo Choi, Jongsun Lee, Jooyoung Choi, Yongkweon Jeon

TL;DR
OmniDrop is a layer-wise, query-guided token pruning method for omni-modal LLMs that reduces latency and memory usage while maintaining high performance in audiovisual understanding.
Contribution
It introduces a training-free, layer-wise token pruning framework that operates within the decoder layers, guided by text queries and a temporal diversity score.
Findings
Outperforms baselines by up to 3.58 points on benchmarks.
Reduces prefill latency by up to 40%.
Decreases memory usage by up to 14.7%.
Abstract
Omni-modal large language models have demonstrated remarkable potential in holistic multimodal understanding; however, the token explosion caused by high-resolution audio and video inputs remains a critical bottleneck for real-time applications and long-form reasoning. Existing omni-modal token compression methods typically prune tokens at the input embedding level, relying on audio-video similarity or temporal co-occurrence as proxies for semantic relevance. In practice, such assumptions are often unreliable. To address this limitation, we propose OmniDrop, a training-free, layer-wise token pruning framework that progressively prunes audiovisual tokens within the LLM decoder layers rather than at the input-level, allowing early layers to preserve sufficient omni-modal information fusion before aggressively removing tokens in deeper layers. We further utilize text queries as guidance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
