Stage-adaptive Token Selection for Efficient Omni-modal LLMs

Zijie Xin; Jie Yang; Ruixiang Zhao; Tianyi Wang; Fengyun Rao; Jing Lyu; Xirong Li

arXiv:2605.20035·cs.CV·May 20, 2026

Stage-adaptive Token Selection for Efficient Omni-modal LLMs

Zijie Xin, Jie Yang, Ruixiang Zhao, Tianyi Wang, Fengyun Rao, Jing Lyu, Xirong Li

PDF

1 Repo

TL;DR

This paper introduces SEATS, a stage-adaptive, training-free token selection method that significantly reduces computational costs in omni-modal LLMs by dynamically pruning tokens across layers while maintaining high performance.

Contribution

The paper proposes a novel, layer-wise, stage-adaptive token selection approach that effectively reduces inference costs in omni-modal LLMs without additional training.

Findings

01

Achieves 9.3x FLOPs reduction with only 4.8x speedup.

02

Retains 96.3% of original performance with only 10% tokens kept.

03

Effectively prunes non-textual tokens across layers based on cross-modal dependency analysis.

Abstract

Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xxayt/SEATS
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.