Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

Muhammad Shahir Abdurrahman; Chun Deng; Azalia Mirhoseini; Philip Levis

arXiv:2605.06206·cs.LG·May 8, 2026

Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

Muhammad Shahir Abdurrahman, Chun Deng, Azalia Mirhoseini, Philip Levis

PDF

TL;DR

The paper introduces Federation of Experts (FoE), a novel architecture that reduces communication bottlenecks in distributed large language models, significantly improving inference speed while maintaining quality.

Contribution

FoE restructures MoE layers to minimize communication overhead, enabling efficient inference in distributed LLMs with up to 5.2x latency reduction.

Findings

01

FoE reduces end-to-end latency by up to 5.2x.

02

FoE maintains comparable generation quality to traditional MoE models.

03

FoE significantly improves throughput and latency in both single-node and multi-node settings.

Abstract

Mixture of experts has emerged as the primary mechanism for making Large Language Models (LLMs) computationally efficient. However, in distributed settings, communicating token embeddings between experts is a significant bottleneck. We present the novel Federation of Experts (FoE) architecture. FoE restructures the MoE block of a transformer layer into multiple MoE clusters. Each cluster is responsible for only one of the KV heads and expert parallelism is applied between those experts. Between clusters, a sum synchronizes the post-attention residuals, which then drives routing and dispatch for the next MoE block. In a single-node setting, FoE completely eliminates all-to-all communication as all experts within a group are contained on the same GPU. In multi-node settings, FoE confines all-to-all communication to the intra-node fabric, thus significantly reducing communication…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.