OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Yue Ding; Yiyan Ji; Jungang Li; Xuyang Liu; Xinlong Chen; Junfei Wu; Bozhou Li; Bohan Zeng; Yang Shi; Yushuo Guan; Yuanxing Zhang; Jiaheng Liu; Qiang Liu; Pengfei Wan; Liang Wang

arXiv:2602.04804·cs.CL·May 14, 2026

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen, Junfei Wu, Bozhou Li, Bohan Zeng, Yang Shi, Yushuo Guan, Yuanxing Zhang, Jiaheng Liu, Qiang Liu, Pengfei Wan, Liang Wang

PDF

1 Models

TL;DR

OmniSIFT is a novel token compression framework for Omni-modal LLMs that reduces computational load by selectively pruning video and audio tokens, maintaining high performance with fewer parameters.

Contribution

It introduces a two-stage, modality-asymmetric token compression method optimized end-to-end, significantly improving efficiency while preserving or enhancing model performance.

Findings

01

OmniSIFT introduces only 4.85M parameters for Qwen2.5-Omni-7B.

02

It outperforms all compression baselines with 25% of original tokens.

03

OmniSIFT maintains lower latency than training-free baselines.

Abstract

Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
dingyue1011/OmniSIFT-7B
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.