Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs
Chaeyoung Jung, Kyeongha Rho, Joon Son Chung

TL;DR
This paper introduces ContextGuard, a novel inference-time token pruning framework for Omni-LLMs that preserves broad audio-visual context and reduces computational cost without fine-tuning.
Contribution
It reframes token reduction as context preservation, predicting coarse semantics from audio to selectively prune video tokens, outperforming prior methods across multiple benchmarks.
Findings
Prunes 55% of tokens while maintaining full performance on most benchmarks.
Outperforms prior pruning methods in token reduction and accuracy.
Requires no downstream fine-tuning, using only a lightweight predictor.
Abstract
Omnimodal Large Language Models (Omni-LLMs) incur substantial computational overhead due to the large number of multimodal input tokens they process, making token reduction essential for real-world deployment. Existing Omni-LLM pruning methods typically reduce this cost by selecting tokens that are important for the current query or strongly aligned with cross-modal cues. However, such strategies can discard evidence that falls outside these criteria, even when needed for different questions or for understanding context beyond aligned audio-visual cues. To address this limitation, we reframe Omni-LLM token reduction as preserving broad audio-visual context while removing cross-modal redundancy. We propose ContextGuard, an inference-time token pruning framework built on this principle. ContextGuard predicts coarse visual semantics from audio and prunes video tokens whose coarse semantics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
