OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

Yuchen Deng; Zidang Cai; Hai-Tao Zheng; Jie Wang; Feidiao Yang; Yuxing Han

arXiv:2605.12056·cs.AI·May 13, 2026

OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

Yuchen Deng, Zidang Cai, Hai-Tao Zheng, Jie Wang, Feidiao Yang, Yuxing Han

PDF

TL;DR

OmniRefine is a training-free, two-stage framework that improves audio-visual token compression in Omni-LLMs, balancing efficiency and performance by preserving cross-modal correspondence.

Contribution

It introduces a novel, training-free method for cross-modally aligned token compression, enhancing inference efficiency without sacrificing accuracy.

Findings

01

Achieves 46.7% accuracy at 44% token retention on WorldSense

02

Outperforms strong baselines in efficiency-performance trade-off

03

Maintains stable performance under lower compression ratios

Abstract

Omnimodal large language models (Omni-LLMs) show strong capability in audio-video understanding, but their practical deployment remains limited by high inference cost of long video streams and dense audio sequences. Despite recent progress, existing compression methods for Omni-LLMs typically rely on fixed or native compression units, which can disrupt cross-modal correspondence and the complementary information required for audio-video reasoning, making it difficult to improve inference efficiency while stably preserving performance. To address this, we propose OmniRefine, a training-free two-stage framework for efficient audio-visual token compression in Omni-LLMs. First, Correspondence-Preserving Chunk Refinement refines native chunk boundaries into cross-modally aligned compression units through frame-audio similarity and dynamic programming. Second, Modality-Aware Cooperative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.