TL;DR
This paper introduces a coarse-to-fine regional prompt control pipeline for DiT-based image generation, leveraging LLMs to improve controllability and image quality through layered cross-attention manipulation.
Contribution
It proposes a novel regional prompt injection method into DiT models, utilizing LLMs for detailed content and style descriptions, enhancing image generation controllability.
Findings
Improved image fidelity and diversity demonstrated.
Layer-specific prompt control enhances regional content accuracy.
Quantitative and qualitative results show performance gains.
Abstract
The diffusion transformer (DiT) architecture has attracted significant attention in image generation, achieving better fidelity, performance, and diversity. However, most existing DiT - based image generation methods focus on global - aware synthesis, and regional prompt control has been less explored. In this paper, we propose a coarse - to - fine generation pipeline for regional prompt - following generation. Specifically, we first utilize the powerful large language model (LLM) to generate both high - level descriptions of the image (such as content, topic, and objects) and low - level descriptions (such as details and style). Then, we explore the influence of cross - attention layers at different depths. We find that deeper layers are always responsible for high - level content control, while shallow layers handle low - level content control. Various prompts are injected into the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need · Attentive Walk-Aggregating Graph Neural Network · Diffusion · Focus
