Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles
Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, Matthew B. Blaschko

TL;DR
This study investigates rule-based visual reinforcement learning using jigsaw puzzles as a structured framework, revealing how multimodal models learn, generalize, and reason in complex visual tasks, with implications for multimodal AI development.
Contribution
It provides the first comprehensive analysis of rule-based visual RL with jigsaw puzzles, highlighting key insights into model learning, reasoning, and generalization behaviors.
Findings
MLLMs improve from near random to high accuracy with fine-tuning.
Training on jigsaw puzzles aids generalization to other visual tasks.
RL outperforms supervised fine-tuning in generalization.
Abstract
The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL, using jigsaw puzzles as a structured experimental framework. Jigsaw puzzles offer inherent ground truth, adjustable difficulty, and demand complex decision-making, making them ideal for this study. Our research reveals several key findings: \textit{Firstly,} we find that MLLMs, initially performing near to random guessing on the simplest jigsaw puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. \textit{Secondly,} training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Handwritten Text Recognition Techniques · Human Pose and Action Recognition
MethodsShrink and Fine-Tune · Jigsaw
