RePack then Refine: Efficient Diffusion Transformer with Vision Foundation Model

Guanfang Dong; Luke Schultz; Negar Hassanpour; Chao Gao

arXiv:2512.12083·cs.CV·May 15, 2026

RePack then Refine: Efficient Diffusion Transformer with Vision Foundation Model

Guanfang Dong, Luke Schultz, Negar Hassanpour, Chao Gao

PDF

1 Repo

TL;DR

The paper introduces RePack then Refine, a three-stage framework that efficiently leverages Vision Foundation Model features for diffusion transformers, significantly improving training speed and generative quality on ImageNet-1K.

Contribution

It proposes a novel feature compression and refinement strategy that enhances diffusion transformer training efficiency and generative performance.

Findings

01

RePack module reduces feature dimensionality while preserving structure.

02

RePack-DiT-XL/1 achieves an FID of 1.82 in 64 epochs.

03

Adding the Refiner improves FID to 1.65, surpassing recent LDMs.

Abstract

Semantic-rich features from Vision Foundation Models (VFMs) have been leveraged to enhance Latent Diffusion Models (LDMs). However, raw VFM features are typically high-dimensional and redundant, increasing the difficulty of learning and reducing training efficiency for Diffusion Transformers (DiTs). In this paper, we propose Repack then Refine, a three-stage framework that brings the semantic-rich VFM features to DiT while further accelerating learning efficiency. Specifically, the RePack module projects the high-dimensional features onto a compact, low-dimensional manifold. This filters out the redundancy while preserving essential structural information. A standard DiT is then trained for generative modeling on this highly compressed latent space. Finally, to restore the high-frequency details lost due to the compression in RePack, we propose a Latent-Guided Refiner, which is trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

guanfangdong/RePack-then-Refine
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.