Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping

Chao Zhou; Tianyi Wei; Yiling Chen; Wenbo Zhou; Nenghai Yu

arXiv:2602.06850·cs.CV·February 9, 2026

Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping

Chao Zhou, Tianyi Wei, Yiling Chen, Wenbo Zhou, Nenghai Yu

PDF

Open Access

TL;DR

This paper introduces PKA, a novel attention framework for multi-condition diffusion models that reduces redundancy, accelerates training, and improves resource efficiency in text-to-image generation.

Contribution

It proposes Position-Aligned and Keyword-Scoped Attention (PKA) to eliminate redundant interactions, along with CSAS for faster training, advancing multi-condition diffusion transformer capabilities.

Findings

01

10.0× inference speedup

02

5.1× VRAM saving

03

Enhanced conditional fidelity

Abstract

While modern text-to-image models excel at prompt-based generation, they often lack the fine-grained control necessary for specific user requirements like spatial layouts or subject appearances. Multi-condition control addresses this, yet its integration into Diffusion Transformers (DiTs) is bottlenecked by the conventional ``concatenate-and-attend'' strategy, which suffers from quadratic computational and memory overhead as the number of conditions scales. Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant. To this end, we propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies. Specifically, Position-Aligned Attention (PAA) linearizes spatial control by enforcing localized patch alignment, while Keyword-Scoped Attention (KSA) prunes irrelevant subject-driven…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Neural Network Applications