Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
Jungwoo Kim, Rubens Lacouture, Genghan Zhang, Gina Sohn, Qizheng Zhang, Swapnil Gandhi, Christos Kozyrakis, Kunle Olukotun

TL;DR
This paper introduces Sieve, a runtime framework with a scheduler that dynamically optimizes PIM acceleration for evolving mixture-of-experts models, significantly improving throughput and interactivity.
Contribution
It presents a novel runtime framework and scheduler that adaptively partition expert execution between GPU and PIM based on runtime distributions, addressing load imbalance and communication issues.
Findings
Sieve improves throughput and interactivity by up to 1.6x over state-of-the-art PIM systems.
Modern MoE models exhibit bimodal token-to-expert distributions, impacting PIM efficiency.
The proposed scheduler effectively balances load and reduces communication overhead.
Abstract
Mixture-of-Experts (MoE) has become a dominant architecture for scaling large language models (LLMs). However, the execution characteristics of MoE inference are changing rapidly and increasingly mismatch the assumptions underlying existing Processing-in-Memory (PIM) systems. Prior PIM systems for LLMs rely on static rules to offload memory-bound operations to PIM, without accounting for the combined effects of load imbalance and inter-GPU communication. Meanwhile, modern MoE models activate fewer experts out of increasingly many, creating a bimodal expert distribution: a small set of experts receives many tokens, while a long tail of experts receives only one or a few. We identify a trend in modern MoE models toward increasingly bimodal token-to-expert distributions, quantify the resulting disparity in arithmetic intensity across experts, and show that this disparity dramatically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
