Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers

Shaodong Xu; Zhendong Wang; Litong Gong; Zexian Li; Wengang Zhou; Tiezheng Ge; Houqiang Li

arXiv:2605.16949·cs.CV·May 19, 2026

Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers

Shaodong Xu, Zhendong Wang, Litong Gong, Zexian Li, Wengang Zhou, Tiezheng Ge, Houqiang Li

PDF

TL;DR

This paper introduces sREPA, a structural alignment method that explicitly models the spatial relationships in visual features to enhance diffusion transformer training.

Contribution

sREPA is a novel framework that enforces structural consistency in feature maps, improving convergence speed and sample quality over existing point-wise alignment methods.

Findings

01

sREPA accelerates training convergence.

02

sREPA improves generation fidelity.

03

sREPA outperforms state-of-the-art alignment strategies.

Abstract

Recent advances in Diffusion Transformers (DiTs) demonstrate that aligning noisy latent states with well-trained semantic features-as pioneered by Representation Alignment (REPA)-can substantially accelerate training and improve generation fidelity. Subsequent analysis(e.g., iREPA) suggests that these gains arise primarily from transferring spatial structure contained in pre-trained vision representations. However, mostly existing alignment methods employ point-wise matching objectives or rely on implicit architectural tweaks, which fail to explicitly model the spatial relational geometry inherent in vision foundation models. We argue that such element-wise supervision is insufficient to capture the rich spatial topology of visual representations, and that effective alignment for generation should instead be formulated as an explicit structural constraint. To this end, we propose sREPA,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.