What matters for Representation Alignment: Global Information or Spatial Structure?
Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, Saining Xie

TL;DR
This paper investigates whether global semantic information or spatial structure in target representations is more important for generative model training, finding spatial structure to be more influential through extensive empirical analysis and simple modifications.
Contribution
The study reveals that spatial structure, not global semantic performance, primarily influences generation quality, and introduces iREPA, a simple method that enhances spatial information transfer and training convergence.
Findings
Spatial structure drives generation performance more than global semantics.
Replacing MLP with convolution improves convergence speed.
Simple modifications consistently enhance training across models.
Abstract
Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward…
Peer Reviews
Decision·ICLR 2026 Poster
- **Fundamental insight that provides stable gains**: the work provides large-scale evidence that spatial structure rather than semantic quality determines usefulness of pertained visual features for generative alignment and uses this insight to implement a simple and minimal intervention, that improves convergence and generative quality consistently. - **Generalization and robustness:** the proposed method is able to show improvements across multiple architectures / encoders (DINOV2, SAM2, CLI
**Effect of removing global semantics:** iREPA intentionally suppresses the global semantic component of pretrained representations to enhance spatial contrast. While this clearly benefits diffusion-based generation, it remains uncertain how much this trade-off might affect tasks that depend on higher-level semantic coherence or multimodal conditioning.
S1. The paper's primary contribution—that spatial structure, not global semantic accuracy, is the key driver for REPA's success—is a significant and non-obvious finding. The authors have conducted a comprehensive set of experiments to validate their claims. S2. The proposed iREPA method is elegantly simple (noted as <4 lines of code) yet highly effective. The two modifications (convolutional projection and spatial normalization) are well-motivated by the paper's core finding and are easy to impl
W1. The paper primarily focuses on improving REPA and its variants. While this is a valid contribution, the proposed method is not benchmarked against other, orthogonal techniques for improving generative model training. W2. In the correlation plot between linear probing accuracy and gFID (Fig 1 left), there are two noticeable outliers (points 25: Mocov3-l, 27: MAE-l) that have both low accuracy and poor (high) gFID. These high-leverage points significantly influence the linear regression.
1. The paper systematically examines the relationship between the spatial structure of representations and metrics such as FID by testing an extensive set of encoders, and further strengthens this conclusion using Pearson correlation coefficients. 2. Building on the above findings, the authors propose two strategies to encourage more effective learning of the spatial structure of representations. Both approaches are visually supported through attention map visualizations and are substantiated b
1. The conclusions of this work rely on the premise that metrics such as FID and IS accurately reflect model generation capability. However, several prior studies[1,2,3, 4] have argued and demonstrated that FID/IS may not fully capture model performance. This introduces some uncertainty to the findings of this work. Thus, it is recommended that the authors provide additional evaluation results based on other metrics - for instance, by replacing the feature extractor used for FID computation. 2.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face Recognition and Perception · Face recognition and analysis
