TL;DR
The paper introduces RePack then Refine, a three-stage framework that efficiently leverages Vision Foundation Model features for diffusion transformers, significantly improving training speed and generative quality on ImageNet-1K.
Contribution
It proposes a novel feature compression and refinement strategy that enhances diffusion transformer training efficiency and generative performance.
Findings
RePack module reduces feature dimensionality while preserving structure.
RePack-DiT-XL/1 achieves an FID of 1.82 in 64 epochs.
Adding the Refiner improves FID to 1.65, surpassing recent LDMs.
Abstract
Semantic-rich features from Vision Foundation Models (VFMs) have been leveraged to enhance Latent Diffusion Models (LDMs). However, raw VFM features are typically high-dimensional and redundant, increasing the difficulty of learning and reducing training efficiency for Diffusion Transformers (DiTs). In this paper, we propose Repack then Refine, a three-stage framework that brings the semantic-rich VFM features to DiT while further accelerating learning efficiency. Specifically, the RePack module projects the high-dimensional features onto a compact, low-dimensional manifold. This filters out the redundancy while preserving essential structural information. A standard DiT is then trained for generative modeling on this highly compressed latent space. Finally, to restore the high-frequency details lost due to the compression in RePack, we propose a Latent-Guided Refiner, which is trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
