UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
Weijia Mao, Zhenheng Yang, Mike Zheng Shou

TL;DR
UniRL introduces a self-improving post-training method for multimodal models that generates its own training data, enhancing both understanding and generation tasks without external data or extensive additional training.
Contribution
It presents a novel self-improving post-training approach that leverages generated data to enhance multimodal models, reducing reliance on external datasets and balancing task performance.
Findings
Achieved a GenEval score of 0.77 on Show-o
Achieved a GenEval score of 0.65 on Janus
Requires only several additional training steps
Abstract
Unified multimodal large language models such as Show-o and Janus have achieved strong performance across both generation and understanding tasks. However, these models typically rely on large-scale datasets and require substantial computation during the pretraining stage. In addition, several post-training methods have been proposed, but they often depend on external data or are limited to task-specific customization. In this work, we introduce UniRL, a self-improving post-training approach. Our approach enables the model to generate images from prompts and use them as training data in each iteration, without relying on any external image data. Moreover, it enables the two tasks to enhance each other: the generated images are used for understanding, and the understanding results are used to supervise generation. We explore supervised fine-tuning (SFT) and Group Relative Policy…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper identifies a real gap in current u-MLLMs, i.e., the lack of mutual improvement between generation and understanding, and argues convincingly that post-training is an efficient place to address it. - It is elegant and practically attractive for models to bootstrap its own samples. - Text is easy to follow.
- Limited novelty: The overall pipeline is a well-known paradigm in self-training and reinforcement learning. - Narrow scope: The method is validated only on GenEval-like attributes, which represent low-level visual features. It remains unclear whether the approach generalizes to more complex open-domain tasks (e.g., human actions, reasoning, visual grounding, or captioning). - The automatic reward relies on keyword or attribute matching. Without semantic or visual similarity measures (e.g., CLI
- **No external training data requirement:** The proposed UniRL uses only model-generated images for post-training, making it more practical and scalable, eliminating the need for expensive real-world data collection and annotation. - **Valuable consistency evaluation metric:** The proposed bidirectional metrics, Accuracy(MMU|T2I) and Accuracy(T2I|MMU), is a useful and well-conceived to quantify the alignment between generation and understanding tasks. - **Insightful analysis:** This paper p
- **Synthetic data reliability and quality control:** While self-generated training data of u-MLLMs is data-friendly, the quality and diversity of generated synthetic data can not be guaranteed. Without guarantees of generated data, the self-training loop risks reinforcing generation biases or error accumulation over iterations. How do you ensure that the self-generated images remain sufficiently diverse and informative across iterations? Is there any mechanism to prevent the accumulation of low
1. The proposed self-improving framework is technically sound. By using generated data for iterative training, no external data is required and the mutual reinforcement of T2I and MMU tasks addresses the generation-understanding imbalance pervasive in unified models. 2. The integration of SFT and GRPO pave the way for further performance improvement in the context of unified models. 3. The overall presentation is clear and easy to follow. 4. The analysis of SFT v.s. GRPO provide valuable insigh
1. In table 2 and table 4, the performance of Count. underperforms than that of the baseline model. However, the prompts construction and evaluation pipeline in figure 1 and figure 2 show that the Count. performance could be improved with the proposed framework. Why the performance is worser? More evaluation details should be included to clarify this. 2. The evaluation only foucuses on basic visual attributes (counting, color, position) but neglects complex reasoning (e.g., physics, temporal d
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
