AHPA: Adaptive Hierarchical Prior Alignment for Diffusion Transformers
Ruibin Min, Yexin Liu, Aimin Pan, Changsheng Lu, Jiafei Wu, Kelu Yao, Xiaogang Xu, Harry Yang

TL;DR
AHPA introduces a hierarchical, adaptive alignment method for diffusion transformers that dynamically adjusts supervision granularity during training, improving convergence and output quality without extra inference costs.
Contribution
The paper proposes AHPA, a novel framework that adaptively aligns hierarchical VAE features at different levels to match the denoising process's changing needs.
Findings
AHPA improves training convergence of diffusion transformers.
AHPA enhances generation quality compared to baseline methods.
AHPA requires no additional inference cost or external supervision during training.
Abstract
Representation alignment has recently emerged as an effective paradigm for accelerating Diffusion Transformer training. Despite their success, existing alignment methods typically impose a fixed supervision target or a fixed alignment granularity throughout the entire denoising trajectory, whether the guidance is provided by external vision encoders, internal self-representations, or VAE-derived features. We argue that such timestep-agnostic alignment is suboptimal because the useful granularity of representation supervision changes systematically with the signal-to-noise ratio. In high-noise regimes, diffusion models benefit more from coarse semantic and layout-level anchoring, whereas in low-noise regimes, the training signal should emphasize spatially detailed and structurally faithful refinement. This non-stationary alignment behavior creates a representational mismatch for static…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
