TL;DR
The paper introduces Spatial Gram Alignment (SGA), a novel method that improves ultra-high-resolution image synthesis by aligning internal self-similarities of generative features with foundation model priors, preserving fidelity and structure.
Contribution
SGA offers a non-invasive spatial constraint approach that enhances large-scale latent diffusion models for ultra-high-resolution synthesis, outperforming existing methods.
Findings
Achieves state-of-the-art results in ultra-high-resolution text-to-image synthesis.
Effectively balances global structural coherence with fine-grained visual details.
Seamlessly integrates with existing pre-trained latent diffusion models.
Abstract
Modern ultra-high-resolution image synthesis relies heavily on the robust generative capacity of large-scale pre-trained Latent Diffusion Models (LDMs). While recent representation alignment methods have proven effective by distilling visual priors from foundation models (e.g., SAM or DINO) into generative latent features, scaling these approaches to pre-trained LDMs at extreme resolutions exposes a critical learnability-fidelity conflict. Specifically, forcing direct patch-wise feature distillation inherently perturbs the pre-trained latent manifold, ultimately leading to generation degradation. To address this bottleneck, we propose Spatial Gram Alignment (SGA), a novel framework that explicitly leverages the representation priors of vision foundation models while preserving the native generative capacity of LDMs. Moving beyond restrictive direct alignment, SGA imposes a non-invasive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
