Registers Matter for Pixel-Space Diffusion Transformers
Nikita Starodubcev, Ilia Sudakov, Ilya Drobyshevskiy, Artem Babenko, Dmitry Baranchuk

TL;DR
This paper investigates the role of register tokens in pixel-space diffusion transformers, demonstrating their benefits for convergence and quality, and proposing a dual-stream architecture for improved performance.
Contribution
It reveals that register tokens enhance pixel-space diffusion transformers by producing cleaner feature maps and introduces a dual-stream architecture for better generation quality.
Findings
Register tokens improve convergence and generation quality in pixel-space DiTs.
Register tokens produce cleaner feature maps at high noise levels.
Recent DiT architectures implicitly incorporate register-like mechanisms.
Abstract
Vision Transformers (ViTs) are known to exhibit high-norm patch-token outliers that degrade feature map quality, a problem effectively mitigated by \textit{register tokens}. As diffusion models increasingly adopt transformer architectures and move toward pixel-space training, they become closer in form to ViTs, raising the question of whether register tokens are also useful for Diffusion Transformers (DiTs). In this work, we show that DiTs differ from ViTs in a key respect: they do not exhibit patch-token outliers. Interestingly, register tokens significantly improve convergence and generation quality of pixel-space DiTs. By analyzing intermediate representations, we find that register tokens produce cleaner feature maps at high noise levels, which may contribute to their effectiveness in pixel-space generation. We further observe that recent pixel-space DiT architectures implicitly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
