TL;DR
This paper introduces expert-choice routing for diffusion language models, improving load balancing, throughput, and convergence by adaptively allocating compute based on denoising step efficiency.
Contribution
It demonstrates that expert-choice routing outperforms token-choice routing in diffusion language models and enables retrofitting existing models for better performance.
Findings
EC routing provides deterministic load balancing and higher throughput.
Allocating more capacity to low-mask-ratio steps improves performance.
Retrofitting TC models to EC yields faster convergence and better accuracy.
Abstract
Diffusion language models (DLMs) enable parallel, non-autoregressive text generation, yet existing DLM mixture-of-experts (MoE) models inherit token-choice (TC) routing from autoregressive systems, leading to load imbalance and rigid computation allocation. We show that expert-choice (EC) routing is a better fit for DLMs: it provides deterministic load balancing by design, yielding higher throughput and faster convergence than TC. Building on the property that EC capacity is externally controllable, we introduce timestep-dependent expert capacity, which varies expert allocation according to the denoising step. We find that allocating more capacity to low-mask-ratio steps consistently achieves the best performance under matched FLOPs, and provide a mechanistic explanation: tokens in low-mask-ratio contexts exhibit an order-of-magnitude higher learning efficiency, so concentrating compute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
