TL;DR
Nucleus-Image introduces a sparse MoE diffusion transformer for text-to-image generation that achieves high quality with significantly fewer active parameters, optimizing efficiency and scalability.
Contribution
It presents a novel sparse MoE architecture with Expert-Choice Routing, optimized training strategies, and a large-scale dataset, advancing high-quality, efficient image generation.
Findings
Matches or exceeds leading models on multiple benchmarks.
Activates only approximately 2B parameters per forward pass.
Achieves high-quality image generation without post-training optimization.
Abstract
We present Nucleus-Image, a text-to-image generation model that establishes a new Pareto frontier in quality-versus-efficiency by matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench while activating only approximately 2B parameters per forward pass. Nucleus-Image employs a sparse mixture-of-experts (MoE) diffusion transformer architecture with Expert-Choice Routing that scales total model capacity to 17B parameters across 64 routed experts per layer. We adopt a streamlined architecture optimized for inference efficiency by excluding text tokens from the transformer backbone entirely and using joint attention that enables text KV sharing across timesteps. To improve routing stability when using timestep modulation, we introduce a decoupled routing design that separates timestep-aware expert assignment from timestep-conditioned expert computation. We construct a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
