UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning

Rui Tian; Mingfei Gao; Haiming Gang; Jiasen Lu; Zhe Gan; Yinfei Yang; Zuxuan Wu; Afshin Dehghan

arXiv:2511.14760·cs.CV·November 19, 2025

UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning

Rui Tian, Mingfei Gao, Haiming Gang, Jiasen Lu, Zhe Gan, Yinfei Yang, Zuxuan Wu, Afshin Dehghan

PDF

Open Access

TL;DR

UniGen-1.5 is a multimodal large language model that advances image understanding, generation, and editing by integrating a unified reinforcement learning approach and improved instruction alignment, achieving state-of-the-art results.

Contribution

The paper introduces a unified RL strategy and a light instruction alignment stage that jointly enhance image editing and generation capabilities in a multimodal model.

Findings

01

Achieves high scores on GenEval and ImgEdit benchmarks.

02

Outperforms previous models like BAGEL in image understanding and editing.

03

Demonstrates competitive performance with proprietary models like GPT-Image-1.

Abstract

We present UniGen-1.5, a unified multimodal large language model (MLLM) for advanced image understanding, generation and editing. Building upon UniGen, we comprehensively enhance the model architecture and training pipeline to strengthen the image understanding and generation capabilities while unlocking strong image editing ability. Especially, we propose a unified Reinforcement Learning (RL) strategy that improves both image generation and image editing jointly via shared reward models. To further enhance image editing performance, we propose a light Edit Instruction Alignment stage that significantly improves the editing instruction comprehension that is essential for the success of the RL training. Experimental results show that UniGen-1.5 demonstrates competitive understanding and generation performance. Specifically, UniGen-1.5 achieves 0.89 and 4.31 overall scores on GenEval and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship