OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models
Morunliu Yang,Ruotao Xu,Le Li,Yue Wang,Jianxin Zhang,Juntao Li,Yihang Lou,Siwei Feng,Peifeng Li

TL;DR
OmniSelect introduces a training-free, modality-adaptive token pruning framework for OmniLLMs, dynamically selecting compression strategies based on cross-modal relevance to improve efficiency without sacrificing performance.
Contribution
It presents a novel, training-free, dynamic token pruning method that adapts to modality importance in multimodal inputs, enhancing efficiency in OmniLLMs.
Findings
Achieves significant token reduction while maintaining performance.
Effectively models modality preferences for dynamic token pruning.
No additional training required for the pruning framework.
Abstract
Omnimodal large language models (OmniLLMs) have recently gained increasing attention for unified audio-video understanding. However, processing long multimodal token sequences introduces substantial computational overhead, making efficient token compression crucial. Existing methods typically rely on fixed, modality-specific guidance, which fails to account for the varying importance of modalities across different queries. To address this limitation, we propose , a training-free, modality-adaptive token pruning framework that dynamically selects appropriate compression strategies for multimodal inputs. Specifically, we leverage a lightweight AudioCLIP model to estimate cross-modal relevance and categorize each input into three pruning regimes: Audio-Centric, Video-Centric, and Uniform pruning. Based on these relevance scores, OmniSelect further performs fine-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
