Rethinking Token Reduction for Diffusion Models via Output-Similarity-Awareness
Hangyeol Lee, Hyojeong Lee, Joo-Young Kim

TL;DR
This paper introduces DiTo, a token reduction method for diffusion transformers that focuses on output token similarity, leading to improved image generation quality and efficiency.
Contribution
DiTo shifts token reduction from input similarity to output similarity, using output-aware matching and scheduling to enhance diffusion model performance.
Findings
DiTo achieves 1.6-3.9 dB higher PSNR than existing methods.
DiTo outperforms previous token reduction techniques across various benchmarks.
DiTo maintains high image quality with comparable computational speed.
Abstract
Diffusion Transformers (DiTs) achieve superior image generation quality but suffer from quadratic computational complexity relative to token count. While various token reduction (TR) methods have been proposed to mitigate this cost, they overlook the primary objective of generative models: minimizing recovery error, which requires reflecting output token similarity. They rely solely on input token similarity inherited from reduction-only ViT paradigms, leading to a fundamental misalignment with this objective. To bridge this gap, we propose DiTo, a novel TR paradigm that shifts the focus toward output-centric token reduction. Based on the observation that output token similarity is consistently preserved across adjacent timesteps, DiTo utilizes prior-step similarities as an effective proxy to establish token correspondences at a Matching timestep, which are then reused across multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
