Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation

Shengyuan Liu; Bo Wang; Ye Ma; Te Yang; Xipeng Cao; Quan Chen; Han Li; Di Dong; Peng Jiang

arXiv:2405.06948·cs.CV·September 10, 2025

Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation

Shengyuan Liu, Bo Wang, Ye Ma, Te Yang, Xipeng Cao, Quan Chen, Han Li, Di Dong, Peng Jiang

PDF

Open Access

TL;DR

This paper introduces a training-free, subject-driven guidance method for compositional text-to-image generation that improves attribute binding and subject fidelity, especially in zero-shot scenarios.

Contribution

It proposes a novel training-free inference guidance framework that enhances attention maps for better compositional generation without additional training.

Findings

01

Improves subject fidelity and attribute binding in generated images.

02

Achieves strong zero-shot compositional generation performance.

03

Introduces GroundingScore metric for evaluating subject alignment.

Abstract

Existing subject-driven text-to-image generation models suffer from tedious fine-tuning steps and struggle to maintain both text-image alignment and subject fidelity. For generating compositional subjects, it often encounters problems such as object missing and attribute mixing, where some subjects in the input prompt are not generated or their attributes are incorrectly combined. To address these limitations, we propose a subject-driven generation framework and introduce training-free guidance to intervene in the generative process during inference time. This approach strengthens the attention map, allowing for precise attribute binding and feature injection for each subject. Notably, our method exhibits exceptional zero-shot generation ability, especially in the challenging task of compositional generation. Furthermore, we propose a novel metric GroundingScore to evaluate subject…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications