HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning

Hongji Yang; Yucheng Zhou; Wencheng Han; Runzhou Tao; Zhongying Qiu; Jianfei Yang; Jianbing Shen

arXiv:2511.19965·cs.CV·November 26, 2025

HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning

Hongji Yang, Yucheng Zhou, Wencheng Han, Runzhou Tao, Zhongying Qiu, Jianfei Yang, Jianbing Shen

PDF

Open Access

TL;DR

HiCoGen introduces a hierarchical, reinforcement learning-based framework for complex text-to-image generation, decomposing prompts into semantic units and iteratively synthesizing images to improve concept coverage and compositional accuracy.

Contribution

The paper presents a novel Hierarchical Compositional Generative framework with a Chain of Synthesis paradigm and a reinforcement learning approach, including a decaying stochasticity schedule, to enhance complex prompt understanding.

Findings

01

Outperforms existing methods in concept coverage

02

Achieves higher compositional accuracy

03

Introduces HiCoPrompt benchmark for evaluation

Abstract

Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications