MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

Wenhui Tan; Xiaoyi Yu; Jiaze Li; Yijing Chen; Jianzhong Ju; Zhenbo Luo; Ruihua Song; Jian Luan

arXiv:2602.22932·cs.CV·February 27, 2026

MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

Wenhui Tan, Xiaoyi Yu, Jiaze Li, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Ruihua Song, Jian Luan

PDF

Open Access

TL;DR

MSJoE introduces a joint framework that combines evolving large language models with a lightweight key-frame sampler, significantly improving long-form video understanding efficiency and accuracy.

Contribution

The paper proposes a novel joint evolution framework for MLLMs and key-frame samplers, optimizing both through reinforcement learning for better long-video comprehension.

Findings

01

Achieves 8.0% accuracy improvement over base MLLM

02

Outperforms strongest baseline by 1.1% accuracy

03

Demonstrates effectiveness on multiple long-video QA datasets

Abstract

Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query-frame similarity matrix. Finally, a lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis