Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference

Kuo Wang; Quanlong Zheng; Junlin Xie; Yanhao Zhang; Jinguo Luo; Haonan Lu; Liang Lin; Fan Zhou; Guanbin Li

arXiv:2508.02134·cs.CV·August 5, 2025

Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference

Kuo Wang, Quanlong Zheng, Junlin Xie, Yanhao Zhang, Jinguo Luo, Haonan Lu, Liang Lin, Fan Zhou, Guanbin Li

PDF

Open Access

TL;DR

Free-MoRef is a training-free method that instantly enhances Video-MLLMs' ability to understand longer videos by multiplexing context perception within a single inference, improving performance and efficiency.

Contribution

It introduces a novel, training-free approach that multiplexes context perception in Video-MLLMs, enabling instant processing of longer videos without additional training.

Findings

01

Achieves 2x to 8x longer input frame perception without compression.

02

Surpasses dedicated long-video-MLLMs in performance.

03

Operates efficiently on a single GPU with instant response.

Abstract

Video Multimodal Large Language Models~(Video-MLLM) have achieved remarkable advancements in video understanding tasks. However, constrained by the context length limitation in the underlying LLMs, existing Video-MLLMs typically exhibit suboptimal performance on long video scenarios. To understand extended input frames, common solutions span token compression and streaming inference techniques, which sacrifice feature granularity or inference efficiency. Differently, to efficiently achieve comprehensive understanding of longer frame inputs, we draw ideas from MoE and propose a training-free approach \textbf{Free-MoRef}, which instantly multiplexes the context perception capabilities of Video-MLLMs within one inference pass. Specifically, Free-MoRef reconstructs the vision tokens into several short sequences as multi-references. Subsequently, we introduce MoRef-attention, which gathers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis