HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

Yu He; Lichen Ma; Zipeng Guo; Xinyuan Shan; Jingling Fu; Dong Chen; Junshi Huang; Yan Li

arXiv:2605.15741·cs.CV·May 18, 2026

HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

Yu He, Lichen Ma, Zipeng Guo, Xinyuan Shan, Jingling Fu, Dong Chen, Junshi Huang, Yan Li

PDF

TL;DR

HyperDiT introduces a novel hyper-connected transformer framework with cross-scale interactions and semantic guidance, achieving state-of-the-art high-fidelity pixel-space diffusion results on ImageNet.

Contribution

It proposes HyperDiT, a unified model with cross-attention and scale-aware embeddings to bridge semantic and pixel scales in diffusion models.

Findings

01

Achieves state-of-the-art FID of 1.56 on ImageNet 256x256.

02

Effectively reduces hallucination and artifacts in high-fidelity generation.

03

Demonstrates superior performance by combining semantic guidance with fine-grained details.

Abstract

Pixel-space diffusion models bypass the reconstruction bottleneck of Variational Autoencoders (VAEs) but face a fundamental "granularity dilemma": capturing global semantics favors large patch scales, while generating high-fidelity details demands fine-grained inputs. To address this issue, we propose HyperDiT, a unified framework establishing Hyper-Connected Cross-Scale Interactions to bridge the semantic and pixel manifold. Diverging from injecting semantics by AdaLN, HyperDiT utilizes Cross-Attention mechanisms, enabling fine-grained tokens to query multi-level semantic anchors globally. To resolve the spatial mismatch during multi-scale interactions, we introduce Scale-Aware Rotary Position Embedding (SA-RoPE) to ensure precise geometric alignment among tokens of varying patch sizes. Furthermore, we incorporate Registers to learn the dense semantics from a pretrained Visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.