Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping
Chao Zhou, Tianyi Wei, Yiling Chen, Wenbo Zhou, Nenghai Yu

TL;DR
This paper introduces PKA, a novel attention framework for multi-condition diffusion models that reduces redundancy, accelerates training, and improves resource efficiency in text-to-image generation.
Contribution
It proposes Position-Aligned and Keyword-Scoped Attention (PKA) to eliminate redundant interactions, along with CSAS for faster training, advancing multi-condition diffusion transformer capabilities.
Findings
10.0× inference speedup
5.1× VRAM saving
Enhanced conditional fidelity
Abstract
While modern text-to-image models excel at prompt-based generation, they often lack the fine-grained control necessary for specific user requirements like spatial layouts or subject appearances. Multi-condition control addresses this, yet its integration into Diffusion Transformers (DiTs) is bottlenecked by the conventional ``concatenate-and-attend'' strategy, which suffers from quadratic computational and memory overhead as the number of conditions scales. Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant. To this end, we propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies. Specifically, Position-Aligned Attention (PAA) linearizes spatial control by enforcing localized patch alignment, while Keyword-Scoped Attention (KSA) prunes irrelevant subject-driven…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Neural Network Applications
