Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task
Jing Wang, Ao Ma, Jiasong Feng, Dawei Leng, Yuhui Yin, Xiaodan Liang

TL;DR
The paper introduces PT-DiT, a sparse attention diffusion transformer that uses proxy tokens to model global visual information efficiently, reducing computation while maintaining competitive performance in image and video generation tasks.
Contribution
It proposes a novel proxy-tokenized attention mechanism with averaging tokens for efficient global modeling in diffusion transformers, and develops the Qihoo-T2X family for various visual tasks.
Findings
Achieves up to 49% reduction in computational complexity compared to DiT.
Maintains competitive performance in image and video generation.
Introduces window and shift window attention to enhance detail modeling.
Abstract
The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy-Tokenized Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the number of representative tokens is much smaller than the total number of tokens) to model global visual information efficiently. Specifically, within each transformer block, we compute an averaging token from each spatial-temporal window to serve as a proxy token for that region. The global semantics are captured through the self-attention of these proxy tokens and then injected into all latent tokens via cross-attention. Simultaneously, we introduce window and shift window attention to address the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques
MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Dropout · Diffusion · Layer Normalization · Position-Wise Feed-Forward Layer · Linear Layer
