FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models
Yonatan Dukler, Guihong Li, Deval Shah, Vikram Appia, Emad Barsoum

TL;DR
FarSkip-Collective is a novel architectural modification that enables overlapping communication with computation in large Mixture of Experts models, significantly improving training and inference efficiency without sacrificing accuracy.
Contribution
The paper introduces FarSkip-Collective, a method to modify model architecture for overlapping communication and computation, maintaining accuracy in large models from 16B to 109B parameters.
Findings
Achieves near-original accuracy in large models after modification.
Enables overlapping communication with computation, accelerating training and inference.
Successfully applied to models like Llama 4 Scout (109B).
Abstract
Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain as capable, especially for large state-of-the-art models and while modifying all of the model layers. We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy on par with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis
