Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations
Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, Weiyao Lin

TL;DR
This paper investigates the limitations of Diffusion Transformers in visual correspondence due to massive activation concentration and introduces DiTF, a normalization framework that improves their performance significantly.
Contribution
The paper uncovers the link between massive activations and AdaLN in DiTs and proposes DiTF, a training-free method that enhances feature extraction for better correspondence results.
Findings
DiTF outperforms DINO and SD-based models in visual correspondence tasks.
Massive activations are linked to AdaLN in DiTs and can be mitigated.
State-of-the-art results achieved on Spair-71k and AP-10K-C.S.
Abstract
Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as \textit{massive activations}, leading to uninformative representations and significant performance degradation for DiTs. The massive activations consistently concentrate at very few fixed dimensions across all image patch tokens, holding little local information. We analyze these dimension-concentrated massive activations and uncover that their concentration is inherently linked to the Adaptive Layer Normalization (AdaLN) in DiTs. Building on these findings, we propose the \textbf{Di}ffusion \textbf{T}ransformer \textbf{F}eature (DiTF),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual perception and processing mechanisms · Advanced Optical Imaging Technologies
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Vision Transformer · Softmax · Diffusion · Position-Wise Feed-Forward Layer
