Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval

Tao Chen; Shaobo Ju; Qiong Wu; Chenxin Fang; Kun Zhang; Jun Peng; Hui Li; Yiyi Zhou; Rongrong Ji

arXiv:2512.08410·cs.CV·April 10, 2026

Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval

Tao Chen, Shaobo Ju, Qiong Wu, Chenxin Fang, Kun Zhang, Jun Peng, Hui Li, Yiyi Zhou, Rongrong Ji

PDF

TL;DR

This paper introduces OneClip-RAG, a novel method that enhances long video understanding in multimodal large language models by using clip retrieval and a new video chunking algorithm, improving efficiency and performance.

Contribution

The paper presents OneClip-RAG, a new paradigm combining clip retrieval with a query-guided chunking algorithm, and introduces SynLongVideo dataset for training and evaluation.

Findings

01

Boosts Qwen3-VL 8B performance to GPT-5 level on MLVU.

02

Enables LLaVA-Video to understand up to an hour of videos in less than 1.2 minutes.

03

Demonstrates superior efficiency and accuracy in long-video understanding.

Abstract

Due to excessive memory overhead, most Multimodal Large Language Models (MLLMs) can only process videos of limited frames. In this paper, we propose an effective and efficient paradigm to remedy this shortcoming, termed One-shot video-Clip based Retrieval-Augmented Generation (OneClip-RAG). Compared with existing video RAG methods, OneClip-RAG makes full use of the merits of video clips for augmented video understanding in terms of both knowledge integrity and semantic coherence. Besides, it is also equipped with a novel query-guided video chunking algorithm that can unify clip chunking and cross-modal retrieval in one processing step, avoiding redundant computations. To improve instruction following, we further propose a new dataset called SynLongVideo and design a progressive training regime for OneClip-RAG. OneClip-RAG is plugged into three recent MLLMs and validated on a set of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.