Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for   Text-to-Any-Task

Jing Wang; Ao Ma; Jiasong Feng; Dawei Leng; Yuhui Yin; Xiaodan Liang

arXiv:2409.04005·cs.CV·October 7, 2024

Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task

Jing Wang, Ao Ma, Jiasong Feng, Dawei Leng, Yuhui Yin, Xiaodan Liang

PDF

Open Access 1 Repo 1 Models

TL;DR

The paper introduces PT-DiT, a sparse attention diffusion transformer that uses proxy tokens to model global visual information efficiently, reducing computation while maintaining competitive performance in image and video generation tasks.

Contribution

It proposes a novel proxy-tokenized attention mechanism with averaging tokens for efficient global modeling in diffusion transformers, and develops the Qihoo-T2X family for various visual tasks.

Findings

01

Achieves up to 49% reduction in computational complexity compared to DiT.

02

Maintains competitive performance in image and video generation.

03

Introduces window and shift window attention to enhance detail modeling.

Abstract

The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy-Tokenized Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the number of representative tokens is much smaller than the total number of tokens) to model global visual information efficiently. Specifically, within each transformer block, we compute an averaging token from each spatial-temporal window to serve as a proxy token for that region. The global semantics are captured through the self-attention of these proxy tokens and then injected into all latent tokens via cross-attention. Simultaneously, we introduce window and shift window attention to address the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

360cvgroup/qihoo-t2x
noneOfficial

Models

🤗
qihoo360/Qihoo-T2X
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques

MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Dropout · Diffusion · Layer Normalization · Position-Wise Feed-Forward Layer · Linear Layer