OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
Yuchen Deng, Zidang Cai, Hai-Tao Zheng, Jie Wang, Feidiao Yang, Yuxing Han

TL;DR
OmniRefine is a training-free, two-stage framework that improves audio-visual token compression in Omni-LLMs, balancing efficiency and performance by preserving cross-modal correspondence.
Contribution
It introduces a novel, training-free method for cross-modally aligned token compression, enhancing inference efficiency without sacrificing accuracy.
Findings
Achieves 46.7% accuracy at 44% token retention on WorldSense
Outperforms strong baselines in efficiency-performance trade-off
Maintains stable performance under lower compression ratios
Abstract
Omnimodal large language models (Omni-LLMs) show strong capability in audio-video understanding, but their practical deployment remains limited by high inference cost of long video streams and dense audio sequences. Despite recent progress, existing compression methods for Omni-LLMs typically rely on fixed or native compression units, which can disrupt cross-modal correspondence and the complementary information required for audio-video reasoning, making it difficult to improve inference efficiency while stably preserving performance. To address this, we propose OmniRefine, a training-free two-stage framework for efficient audio-visual token compression in Omni-LLMs. First, Correspondence-Preserving Chunk Refinement refines native chunk boundaries into cross-modally aligned compression units through frame-audio similarity and dynamic programming. Second, Modality-Aware Cooperative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
