ViSS-R1: Self-Supervised Reinforcement Video Reasoning

Bo Fang; Yuxin Song; Qiangqiang Wu; Haoyuan Sun; Wenhao Wu; Antoni B. Chan

arXiv:2511.13054·cs.CV·November 18, 2025

ViSS-R1: Self-Supervised Reinforcement Video Reasoning

Bo Fang, Yuxin Song, Qiangqiang Wu, Haoyuan Sun, Wenhao Wu, Antoni B. Chan

PDF

Open Access

TL;DR

This paper introduces ViSS-R1, a self-supervised reinforcement learning framework that enhances multimodal large language models' ability to perform robust, visual-centric video reasoning by integrating pretext tasks into the training pipeline.

Contribution

It proposes the ViSS-R1 framework and Pretext-GRPO algorithm, which improve video reasoning by emphasizing visual information processing and self-supervised learning within the R1 paradigm.

Findings

01

Outperforms existing methods on six video reasoning benchmarks.

02

Enhances the model's ability to process transformed visual inputs.

03

Demonstrates robustness and superiority in complex video reasoning tasks.

Abstract

Complex video reasoning remains a significant challenge for Multimodal Large Language Models (MLLMs), as current R1-based methodologies often prioritize text-centric reasoning derived from text-based and image-based developments. In video tasks, such strategies frequently underutilize rich visual information, leading to potential shortcut learning and increased susceptibility to hallucination. To foster a more robust, visual-centric video understanding, we start by introducing a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline, in which positive rewards are assigned for correctly solving pretext tasks on transformed visual inputs, which makes the model to non-trivially process the visual information. Building on the effectiveness of Pretext-GRPO, we further propose the ViSS-R1 framework, which streamlines and integrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks