RiT: Vanilla Diffusion Transformers Suffice in Representation Space

Le Zhang; Ning Mang; Aishwarya Agrawal

arXiv:2605.21981·cs.CV·May 22, 2026

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

Le Zhang, Ning Mang, Aishwarya Agrawal

PDF

1 Repo 1 Models

TL;DR

This paper introduces RiT, a diffusion model using frozen DINOv2 features with a simple architecture, achieving state-of-the-art image generation quality efficiently.

Contribution

It demonstrates that pretrained representation spaces like DINOv2 can be effectively used for flow-matching diffusion models without complex heads or transport methods.

Findings

01

RiT achieves FID 1.45 on ImageNet 256x256 without guidance.

02

With classifier-free guidance, RiT reaches FID 1.14.

03

RiT outperforms larger models with fewer parameters.

Abstract

Flow matching with $x$ -prediction -- regressing the clean data point rather than the ambient velocity -- is known to exploit low-dimensional manifold structure effectively in pixel space \cite{li2025back}. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both $\hat{d} \approx 33$ ) yet DINOv2 exhibits $7.3 \times$ higher effective rank, $35 \times$ better covariance conditioning, $11.5 \times$ lower excess kurtosis, and $1.7 \times$ lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lezhang7/RiT
github

Models

🤗
le723z/RiT
model· ♡ 5
♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.