Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching

Aihua Li

arXiv:2604.15009·cs.AI·April 17, 2026

Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching

Aihua Li

PDF

TL;DR

This paper introduces MoE-FM, a mixture-of-experts flow matching framework, enabling faster, high-quality non-autoregressive language generation with significant speed improvements over traditional models.

Contribution

It proposes MoE-FM to better model complex latent distributions and develops YAN, a non-autoregressive language model with Transformer and Mamba architectures, achieving rapid inference.

Findings

01

YAN matches autoregressive and diffusion model quality

02

YAN requires as few as three sampling steps

03

Yields up to 40x speedup over AR models and 1000x over diffusion models

Abstract

Flow matching retains the generation quality of diffusion models while enabling substantially faster inference, making it a compelling paradigm for generative modeling. However, when applied to language modeling, it exhibits fundamental limitations in representing complex latent distributions with irregular geometries, such as anisotropy and multimodality. To address these challenges, we propose a mixture-of-experts flow matching (MoE-FM) framework, which captures complex global transport geometries in latent space by decomposing them into locally specialized vector fields. Building on MoE-FM, we develop a non-autoregressive (NAR) language modeling approach, named YAN, instantiated with both Transformer and Mamba architectures. Across multiple downstream tasks, YAN achieves generation quality on par with both autoregressive (AR) and diffusion-based NAR language models, while requiring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.