Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching
Aihua Li

TL;DR
This paper introduces MoE-FM, a mixture-of-experts flow matching framework, enabling faster, high-quality non-autoregressive language generation with significant speed improvements over traditional models.
Contribution
It proposes MoE-FM to better model complex latent distributions and develops YAN, a non-autoregressive language model with Transformer and Mamba architectures, achieving rapid inference.
Findings
YAN matches autoregressive and diffusion model quality
YAN requires as few as three sampling steps
Yields up to 40x speedup over AR models and 1000x over diffusion models
Abstract
Flow matching retains the generation quality of diffusion models while enabling substantially faster inference, making it a compelling paradigm for generative modeling. However, when applied to language modeling, it exhibits fundamental limitations in representing complex latent distributions with irregular geometries, such as anisotropy and multimodality. To address these challenges, we propose a mixture-of-experts flow matching (MoE-FM) framework, which captures complex global transport geometries in latent space by decomposing them into locally specialized vector fields. Building on MoE-FM, we develop a non-autoregressive (NAR) language modeling approach, named YAN, instantiated with both Transformer and Mamba architectures. Across multiple downstream tasks, YAN achieves generation quality on par with both autoregressive (AR) and diffusion-based NAR language models, while requiring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
