TL;DR
SSL-R1 introduces a self-supervised reinforcement learning framework that generates verifiable visual rewards from images, enhancing multimodal large language models' reasoning without human annotations.
Contribution
It reformulates self-supervised visual tasks into verifiable rewards for RL post-training, improving MLLMs' visual understanding and reasoning capabilities.
Findings
Training on visual puzzles improves multimodal reasoning benchmarks.
The framework eliminates the need for human or external supervision.
Enhances the scalability of RL for multimodal models.
Abstract
Reinforcement learning (RL) with verifiable rewards (RLVR) has demonstrated the great potential of enhancing the reasoning abilities in multimodal large language models (MLLMs). However, the reliance on language-centric priors and expensive manual annotations prevents MLLMs' intrinsic visual understanding and scalable reward designs. In this work, we introduce SSL-R1, a generic self-supervised RL framework that derives verifiable rewards directly from images. To this end, we revisit self-supervised learning (SSL) in visual domains and reformulate widely-used SSL tasks into a set of verifiable visual puzzles for RL post-training, requiring neither human nor external model supervision. Training MLLMs on these tasks substantially improves their performance on multimodal understanding and reasoning benchmarks, highlighting the potential of leveraging vision-centric self-supervised tasks for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
