MVP: Enhancing Video Large Language Models via Self-supervised Masked Video Prediction
Xiaokun Sun, Zezhong Wu, Zewen Ding, Linli Xu

TL;DR
This paper introduces Masked Video Prediction (MVP), a novel self-supervised training objective for Video Large Language Models that improves their understanding of temporal dynamics and causal relationships in videos.
Contribution
The paper proposes MVP, a new post-training method with a scalable data synthesis pipeline and a reward-based optimization to enhance temporal reasoning in VideoLLMs.
Findings
MVP significantly improves video reasoning and temporal understanding.
The scalable data pipeline enables efficient training on large video datasets.
Models trained with MVP outperform baselines on video reasoning benchmarks.
Abstract
Reinforcement learning based post-training paradigms for Video Large Language Models (VideoLLMs) have achieved significant success by optimizing for visual-semantic tasks such as captioning or VideoQA. However, while these approaches effectively enhance perception abilities, they primarily target holistic content understanding, often lacking explicit supervision for intrinsic temporal coherence and inter-frame correlations. This tendency limits the models' ability to capture intricate dynamics and fine-grained visual causality. To explicitly bridge this gap, we propose a novel post-training objective: Masked Video Prediction (MVP). By requiring the model to reconstruct a masked continuous segment from a set of challenging distractors, MVP forces the model to attend to the sequential logic and temporal context of events. To support scalable training, we introduce a scalable data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
