MVP: Enhancing Video Large Language Models via Self-supervised Masked Video Prediction

Xiaokun Sun; Zezhong Wu; Zewen Ding; Linli Xu

arXiv:2601.03781·cs.CV·January 8, 2026

MVP: Enhancing Video Large Language Models via Self-supervised Masked Video Prediction

Xiaokun Sun, Zezhong Wu, Zewen Ding, Linli Xu

PDF

Open Access

TL;DR

This paper introduces Masked Video Prediction (MVP), a novel self-supervised training objective for Video Large Language Models that improves their understanding of temporal dynamics and causal relationships in videos.

Contribution

The paper proposes MVP, a new post-training method with a scalable data synthesis pipeline and a reward-based optimization to enhance temporal reasoning in VideoLLMs.

Findings

01

MVP significantly improves video reasoning and temporal understanding.

02

The scalable data pipeline enables efficient training on large video datasets.

03

Models trained with MVP outperform baselines on video reasoning benchmarks.

Abstract

Reinforcement learning based post-training paradigms for Video Large Language Models (VideoLLMs) have achieved significant success by optimizing for visual-semantic tasks such as captioning or VideoQA. However, while these approaches effectively enhance perception abilities, they primarily target holistic content understanding, often lacking explicit supervision for intrinsic temporal coherence and inter-frame correlations. This tendency limits the models' ability to capture intricate dynamics and fine-grained visual causality. To explicitly bridge this gap, we propose a novel post-training objective: Masked Video Prediction (MVP). By requiring the model to reconstruct a masked continuous segment from a set of challenging distractors, MVP forces the model to attend to the sequential logic and temporal context of events. To support scalable training, we introduce a scalable data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning