Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

Siqi Luo; Jianghan Shen; Yi Xin; Huayu Zheng; Haoxing Chen; Yan Tai; Yue Li; Junjun He; Yihao Liu; Guangtao Zhai; Yuewen Cao; Xiaohong Liu

arXiv:2605.16842·cs.AI·May 19, 2026

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

Siqi Luo, Jianghan Shen, Yi Xin, Huayu Zheng, Haoxing Chen, Yan Tai, Yue Li, Junjun He, Yihao Liu, Guangtao Zhai, Yuewen Cao, Xiaohong Liu

PDF

TL;DR

This paper introduces HT-GRPO, a hierarchical reinforcement learning method with a Sketch-Then-Paint scheme for optimizing diffusion multi-modal large language models, improving image quality and structural fidelity.

Contribution

It proposes a novel hierarchical policy optimization framework that incorporates the generation process hierarchy and a prompt-conditioned importance ratio estimator.

Findings

01

Significant performance improvements on GenEval and DPG benchmarks.

02

Enhanced image quality, aesthetics, and human preference scores.

03

Effective reward propagation for key structural tokens.

Abstract

Diffusion Multi-Modal Large Language Models (dMLLMs) are powerful for image generation, but optimizing them through reinforcement learning (RL) remains a major challenge. One primary difficulty is that a single image can be generated through many different unmasking sequences, which makes calculating importance ratios often intractable. Additionally, existing methods tend to ignore the hierarchical generation process of dMLLMs, where early tokens define the global layout and later tokens focus on local details. By assigning uniform rewards to all tokens, these current methods fail to reflect the actual contribution of each token to the final image. To address these issues, we propose Hierarchical Token GRPO (HT-GRPO), which integrates this hierarchy directly into the policy optimization process. Our approach features a Sketch-Then-Paint training scheme that organizes updates into three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.