MoE-DiffuSeq: Enhancing Long-Document Diffusion Models with Sparse Attention and Mixture of Experts
Alexandros Christoforos, Chadbourne Davis

TL;DR
MoE-DiffuSeq introduces a scalable diffusion framework for long-document text generation by combining sparse attention, mixture of experts, and a novel diffusion process design, significantly improving efficiency and quality.
Contribution
It presents a novel diffusion-based architecture integrating sparse attention and MoE, with a soft absorbing state to enhance long-form text generation efficiency and coherence.
Findings
Outperforms prior models in training efficiency and inference speed.
Maintains high generation quality on long-document benchmarks.
Effective for scientific, code, and dialogue generation tasks.
Abstract
We propose \textbf{MoE-DiffuSeq}, a diffusion-based framework for efficient long-form text generation that integrates sparse attention with a Mixture-of-Experts (MoE) architecture. Existing sequence diffusion models suffer from prohibitive computational and memory costs when scaling to long documents, largely due to dense attention and slow iterative reconstruction. MoE-DiffuSeq addresses these limitations by combining expert routing with a tailored sparse attention mechanism, substantially reducing attention complexity while preserving global coherence and textual fidelity. In addition, we introduce a \emph{soft absorbing state} within the diffusion process that reshapes attention dynamics during denoising, enabling faster sequence reconstruction and more precise token refinement. This design accelerates both training and sampling without sacrificing generation quality. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Advanced Text Analysis Techniques
