VideoSSR: Video Self-Supervised Reinforcement Learning

Zefeng He; Xiaoye Qu; Yafu Li; Siyuan Huang; Daizong Liu; Yu Cheng

arXiv:2511.06281·cs.CV·November 11, 2025

VideoSSR: Video Self-Supervised Reinforcement Learning

Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, Yu Cheng

PDF

Open Access 1 Models

TL;DR

VideoSSR introduces a self-supervised reinforcement learning framework that leverages intrinsic video information through novel pretext tasks, significantly improving multimodal large language models' understanding across diverse video benchmarks.

Contribution

It proposes a new self-supervised learning framework, VideoSSR, with three pretext tasks and a large dataset, enhancing video understanding in multimodal models.

Findings

01

VideoSSR improves model performance by over 5% on average across 17 benchmarks.

02

The introduced pretext tasks are effective in harnessing intrinsic video information.

03

State-of-the-art models struggle with the new pretext tasks, indicating their difficulty.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has substantially advanced the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, the rapid progress of MLLMs is outpacing the complexity of existing video datasets, while the manual annotation of new, high-quality data remains prohibitively expensive. This work investigates a pivotal question: Can the rich, intrinsic information within videos be harnessed to self-generate high-quality, verifiable training data? To investigate this, we introduce three self-supervised pretext tasks: Anomaly Grounding, Object Counting, and Temporal Jigsaw. We construct the Video Intrinsic Understanding Benchmark (VIUBench) to validate their difficulty, revealing that current state-of-the-art MLLMs struggle significantly on these tasks. Building upon these pretext tasks, we develop the VideoSSR-30K dataset and propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
yhx12/VideoSSR
model· 7 dl· ♡ 2
7 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis