Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning
Kaihang Pan, Yang Wu, Wendong Bu, Kai Shen, Juncheng Li, Yingting Wang, Yunfei Li, Siliang Tang, Jun Xiao, Fei Wu, Hang Zhao, Yueting Zhuang

TL;DR
This paper introduces Janus-Pro-R1, a multimodal large language model that integrates visual comprehension and generation through reinforcement learning, enabling iterative, introspective image creation and improved semantic understanding.
Contribution
It presents a novel two-stage training method combining supervised fine-tuning and reinforcement learning to unify visual comprehension and generation in MLLMs.
Findings
Outperforms existing models in text-to-image generation and editing
Functions as a superior image semantic evaluator
Enables iterative, introspective image generation
Abstract
Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTactile and Sensory Interactions · Visual and Cognitive Learning Processes
