OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance

Yeo Jeong Park; Hyemi Jang; Minseo Choi; Jongsun Lee; Jooyoung Choi; Yongkweon Jeon

arXiv:2605.14458·cs.AI·May 15, 2026

OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance

Yeo Jeong Park, Hyemi Jang, Minseo Choi, Jongsun Lee, Jooyoung Choi, Yongkweon Jeon

PDF

TL;DR

OmniDrop is a layer-wise, query-guided token pruning method for omni-modal LLMs that reduces latency and memory usage while maintaining high performance in audiovisual understanding.

Contribution

It introduces a training-free, layer-wise token pruning framework that operates within the decoder layers, guided by text queries and a temporal diversity score.

Findings

01

Outperforms baselines by up to 3.58 points on benchmarks.

02

Reduces prefill latency by up to 40%.

03

Decreases memory usage by up to 14.7%.

Abstract

Omni-modal large language models have demonstrated remarkable potential in holistic multimodal understanding; however, the token explosion caused by high-resolution audio and video inputs remains a critical bottleneck for real-time applications and long-form reasoning. Existing omni-modal token compression methods typically prune tokens at the input embedding level, relying on audio-video similarity or temporal co-occurrence as proxies for semantic relevance. In practice, such assumptions are often unreliable. To address this limitation, we propose OmniDrop, a training-free, layer-wise token pruning framework that progressively prunes audiovisual tokens within the LLM decoder layers rather than at the input-level, allowing early layers to preserve sufficient omni-modal information fusion before aggressively removing tokens in deeper layers. We further utilize text queries as guidance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.