HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion
Yu He, Lichen Ma, Zipeng Guo, Xinyuan Shan, Jingling Fu, Dong Chen, Junshi Huang, Yan Li

TL;DR
HyperDiT introduces a novel hyper-connected transformer framework with cross-scale interactions and semantic guidance, achieving state-of-the-art high-fidelity pixel-space diffusion results on ImageNet.
Contribution
It proposes HyperDiT, a unified model with cross-attention and scale-aware embeddings to bridge semantic and pixel scales in diffusion models.
Findings
Achieves state-of-the-art FID of 1.56 on ImageNet 256x256.
Effectively reduces hallucination and artifacts in high-fidelity generation.
Demonstrates superior performance by combining semantic guidance with fine-grained details.
Abstract
Pixel-space diffusion models bypass the reconstruction bottleneck of Variational Autoencoders (VAEs) but face a fundamental "granularity dilemma": capturing global semantics favors large patch scales, while generating high-fidelity details demands fine-grained inputs. To address this issue, we propose HyperDiT, a unified framework establishing Hyper-Connected Cross-Scale Interactions to bridge the semantic and pixel manifold. Diverging from injecting semantics by AdaLN, HyperDiT utilizes Cross-Attention mechanisms, enabling fine-grained tokens to query multi-level semantic anchors globally. To resolve the spatial mismatch during multi-scale interactions, we introduce Scale-Aware Rotary Position Embedding (SA-RoPE) to ensure precise geometric alignment among tokens of varying patch sizes. Furthermore, we incorporate Registers to learn the dense semantics from a pretrained Visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
