TL;DR
This paper introduces RiT, a diffusion model using frozen DINOv2 features with a simple architecture, achieving state-of-the-art image generation quality efficiently.
Contribution
It demonstrates that pretrained representation spaces like DINOv2 can be effectively used for flow-matching diffusion models without complex heads or transport methods.
Findings
RiT achieves FID 1.45 on ImageNet 256x256 without guidance.
With classifier-free guidance, RiT reaches FID 1.14.
RiT outperforms larger models with fewer parameters.
Abstract
Flow matching with -prediction -- regressing the clean data point rather than the ambient velocity -- is known to exploit low-dimensional manifold structure effectively in pixel space \cite{li2025back}. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both ) yet DINOv2 exhibits higher effective rank, better covariance conditioning, lower excess kurtosis, and lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
