Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers
Shaodong Xu, Zhendong Wang, Litong Gong, Zexian Li, Wengang Zhou, Tiezheng Ge, Houqiang Li

TL;DR
This paper introduces sREPA, a structural alignment method that explicitly models the spatial relationships in visual features to enhance diffusion transformer training.
Contribution
sREPA is a novel framework that enforces structural consistency in feature maps, improving convergence speed and sample quality over existing point-wise alignment methods.
Findings
sREPA accelerates training convergence.
sREPA improves generation fidelity.
sREPA outperforms state-of-the-art alignment strategies.
Abstract
Recent advances in Diffusion Transformers (DiTs) demonstrate that aligning noisy latent states with well-trained semantic features-as pioneered by Representation Alignment (REPA)-can substantially accelerate training and improve generation fidelity. Subsequent analysis(e.g., iREPA) suggests that these gains arise primarily from transferring spatial structure contained in pre-trained vision representations. However, mostly existing alignment methods employ point-wise matching objectives or rely on implicit architectural tweaks, which fail to explicitly model the spatial relational geometry inherent in vision foundation models. We argue that such element-wise supervision is insufficient to capture the rich spatial topology of visual representations, and that effective alignment for generation should instead be formulated as an explicit structural constraint. To this end, we propose sREPA,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
