DC-DiT: Adaptive Compute and Elastic Inference for Visual Generation via Dynamic Chunking
Akash Haridas, Utkarsh Saxena, Parsa Ashrafi Fashi, Mehdi Rezagholizadeh, Vikram Appia, Emad Barsoum

TL;DR
DC-DiT introduces a learned, adaptive tokenization mechanism for diffusion transformers, enabling efficient, flexible image generation with dynamic compute allocation and improved quality-compute tradeoffs.
Contribution
The paper presents a novel adaptive chunking approach that replaces static patchification, allowing for importance-based token compression and elastic inference in diffusion models.
Findings
Reduces inference FLOPs by up to 36.8% on ImageNet.
Improves FID scores by up to 37.8% over baseline models.
Enables flexible inference with a smooth quality-compute tradeoff.
Abstract
Diffusion Transformers rely on static patchify tokenization, assigning the same token budget to smooth backgrounds, detailed object regions, noisy early timesteps, and late-stage refinements. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which replaces fixed patchification with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence through a chunking mechanism learned end-to-end with diffusion training. DC-DiT allocates fewer tokens to predictable regions and noisy timesteps, and more tokens to detailed regions and later refinement stages, yielding meaningful spatial segmentations and timestep-adaptive compression schedules without supervision. Furthermore, the router provides an importance ordering over retained tokens, enabling elastic inference: a single checkpoint can be evaluated at flexible compute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
