Towards Building Private LLMs: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model

Mu-Chi Chen; Po-Hsuan Huang; Xiangrui Ke; Chia-Heng Tu; Chun Jason Xue; Shih-Hao Hung

arXiv:2506.23635·cs.DC·July 1, 2025

Towards Building Private LLMs: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model

Mu-Chi Chen, Po-Hsuan Huang, Xiangrui Ke, Chia-Heng Tu, Chun Jason Xue, Shih-Hao Hung

PDF

TL;DR

This paper demonstrates that multi-node expert parallelism on Apple Silicon can effectively reduce inference time and cost for private large language models, with optimized memory management improving efficiency.

Contribution

It introduces a cost-efficient Mac Studio cluster setup for hosting MoE-based LLMs and develops optimization schemes to address memory management overhead.

Findings

01

Parallel expert execution reduces inference time significantly.

02

Communication latency outweighs bandwidth in expert synchronization.

03

Optimizations improve cost-efficiency, making Mac clusters competitive with high-end supercomputers.

Abstract

Large Language Models (LLMs) have revolutionized Artificial Intelligence (AI) with significant advancements such as OpenAI's ChatGPT, Meta's Llama, and Databricks' DBRX. This paper addresses the cost and scalability challenges encountered when constructing private LLM systems for personal or small group services, as aimed by Apple Intelligence. A Mac Studio cluster with Apple's M2 Ultra chips is established as a cost-efficient solution to host and accelerate the pretrained DBRX model with the Mixture-of-Experts (MoE) architecture. Our performance analysis reveal that parallel execution of the model's experts across two to four machine nodes significantly reduces inference time. We find that computation time for the experts is comparable to the communication time for exchanging their outputs, emphasizing the importance of network latency over bandwidth. We also observe significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.