RaDL: Relation-aware Disentangled Learning for Multi-Instance Text-to-Image Generation

Geon Park; Seon Bin Kim; Gunho Jung; and Seong-Whan Lee

arXiv:2507.11947·cs.CV·July 17, 2025

RaDL: Relation-aware Disentangled Learning for Multi-Instance Text-to-Image Generation

Geon Park, Seon Bin Kim, Gunho Jung, and Seong-Whan Lee

PDF

Open Access

TL;DR

RaDL introduces a relation-aware disentangled learning framework that improves multi-instance text-to-image generation by better modeling relationships and attributes of instances, outperforming existing methods on key benchmarks.

Contribution

This paper presents RaDL, a novel framework that enhances multi-instance T2I generation by incorporating relation-aware features and disentangled attributes, addressing previous limitations.

Findings

01

RaDL achieves higher positional accuracy on COCO-Position.

02

RaDL effectively models multiple attributes of instances.

03

RaDL outperforms existing methods on COCO-MIG and DrawBench.

Abstract

With recent advancements in text-to-image (T2I) models, effectively generating multiple instances within a single image prompt has become a crucial challenge. Existing methods, while successful in generating positions of individual instances, often struggle to account for relationship discrepancy and multiple attributes leakage. To address these limitations, this paper proposes the relation-aware disentangled learning (RaDL) framework. RaDL enhances instance-specific attributes through learnable parameters and generates relation-aware image features via Relation Attention, utilizing action verbs extracted from the global prompt. Through extensive evaluations on benchmarks such as COCO-Position, COCO-MIG, and DrawBench, we demonstrate that RaDL outperforms existing methods, showing significant improvements in positional accuracy, multiple attributes consideration, and the relationships…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques · Digital Media Forensic Detection