TL;DR
This paper introduces SEATS, a stage-adaptive, training-free token selection method that significantly reduces computational costs in omni-modal LLMs by dynamically pruning tokens across layers while maintaining high performance.
Contribution
The paper proposes a novel, layer-wise, stage-adaptive token selection approach that effectively reduces inference costs in omni-modal LLMs without additional training.
Findings
Achieves 9.3x FLOPs reduction with only 4.8x speedup.
Retains 96.3% of original performance with only 10% tokens kept.
Effectively prunes non-textual tokens across layers based on cross-modal dependency analysis.
Abstract
Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
