TL;DR
PRISM is a compositional method that enhances long-text-to-image generation by decomposing prompts into components, allowing pre-trained models to better capture intricate details and generalize to longer inputs.
Contribution
It introduces a lightweight module for prompt decomposition and a novel energy-based merging technique, improving long prompt processing without fine-tuning models.
Findings
PRISM achieves comparable performance to fine-tuned models on long prompts.
It outperforms baselines by 7.4% on prompts over 500 tokens.
Demonstrates superior generalization across various architectures.
Abstract
While modern text-to-image (T2I) models excel at generating images from intricate prompts, they struggle to capture the key details when the inputs are descriptive paragraphs. This limitation stems from the prevalence of concise captions that shape their training distributions. Existing methods attempt to bridge this gap by either fine-tuning T2I models on long prompts, which generalizes poorly to longer lengths; or by projecting the oversize inputs into normal-prompt space and compromising fidelity. We propose Prompt Refraction for Intricate Scene Modeling (PRISM), a compositional approach that enables pre-trained T2I models to process long sequence inputs. PRISM uses a lightweight module to extract constituent representations from the long prompts. The T2I model makes independent noise predictions for each component, and their outputs are merged into a single denoising step using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
