TL;DR
This paper introduces DiT-ST, a split-text conditioning framework that improves text-to-image diffusion transformers by explicitly modeling semantic primitives through hierarchical, incremental input injection, leading to better caption understanding.
Contribution
It proposes a novel split-text conditioning method using LLMs and staged token injection to enhance semantic comprehension in diffusion models.
Findings
DiT-ST outperforms baseline models in caption understanding accuracy.
Hierarchical token injection improves semantic primitive representation.
Experimental results validate the effectiveness of split-text conditioning.
Abstract
Current text-to-image diffusion generation typically employs complete-text conditioning. Due to the intricate syntax, diffusion transformers (DiTs) inherently suffer from a comprehension defect of complete-text captions. One-fly complete-text input either overlooks critical semantic details or causes semantic confusion by simultaneously modeling diverse semantic primitive types. To mitigate this defect of DiTs, we propose a novel split-text conditioning framework named DiT-ST. This framework converts a complete-text caption into a split-text caption, a collection of simplified sentences, to explicitly express various semantic primitives and their interconnections. The split-text caption is then injected into different denoising stages of DiT-ST in a hierarchical and incremental manner. Specifically, DiT-ST leverages Large Language Models to parse captions, extracting diverse primitives…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
