TL;DR
This paper introduces HASTE, a two-phase training schedule that accelerates diffusion transformer training by combining holistic alignment with stage-wise termination, significantly reducing training time while maintaining performance.
Contribution
HASTE is a novel training method that improves diffusion transformer efficiency by dynamically balancing alignment and generative focus without architectural changes.
Findings
HASTE reduces training steps by 28 times compared to baseline.
It achieves comparable image quality in 50 epochs versus 500 epochs.
HASTE improves text-to-image diffusion models on MS-COCO.
Abstract
Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need · Focus · Diffusion
