Taming Outlier Tokens in Diffusion Transformers
Xiaoyu Wu, Yifei Wang, Tsu-Jui Fu, Liang-Chieh Chen, Zhe Gan, Chen Wei

TL;DR
This paper investigates outlier tokens in Diffusion Transformers for image generation, revealing their impact and proposing Dual-Stage Registers to mitigate artifacts and enhance quality.
Contribution
It introduces Dual-Stage Registers (DSR), a novel intervention method to control outlier tokens in DiTs, improving image generation performance.
Findings
Outlier tokens appear in both encoder and denoiser components.
Masking high-norm tokens does not improve performance.
DSR interventions reduce artifacts and improve quality across datasets.
Abstract
We study outlier tokens in Diffusion Transformers (DiTs) for image generation. Prior work has shown that Vision Transformers (ViTs) can produce a small number of high-norm tokens that attract disproportionate attention while carrying limited local information, but their role in generative models remains underexplored. We show that this phenomenon appears in both the encoder and denoiser of modern Representation Autoencoder (RAE)-DiT pipelines: pretrained ViT encoders can produce outlier representations, and DiTs themselves can develop internal outlier tokens, especially in intermediate layers. Moreover, simply masking high-norm tokens does not improve performance, indicating that the problem is not only caused by a few extreme values, but is more closely related to corrupted local patch semantics. To address this issue, we introduce Dual-Stage Registers (DSR), a register-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
