VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
Kuanwei Lin, Wenhao Zhang, Ge Li

TL;DR
VideoRouter introduces a query-adaptive dual-routing framework that selectively compresses long videos, significantly reducing tokens while preserving critical evidence for efficient understanding.
Contribution
It proposes a novel query-aware dual-router system built on InternVL, with new datasets for supervision, improving efficiency and accuracy in long-video processing.
Findings
Achieves up to 67.9% token reduction compared to baseline.
Improves performance on VideoMME, MLVU, and LongVideoBench datasets.
Effectively balances compression and evidence preservation.
Abstract
Video large multimodal models increasingly face a scalability bottleneck: long videos produce excessively long visual-token sequences, which sharply increase memory and latency during inference. While existing compression methods are effective in specific settings, most are either weakly query-aware or apply a fixed compression policy across frames, proving suboptimal when visual evidence is unevenly distributed over time. To address this, we present VideoRouter, a query-adaptive dual-router framework built on InternVL for budgeted evidence allocation. The Semantic Router predicts the dominant allocation policy, choosing between broad temporal coverage and adaptive high-resolution preservation, while the Image Router uses early LLM layers to score frame relevance. This enables aggressive compression on less relevant frames while preserving detail on critical evidence frames. To train…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
