Area-Efficient In-Memory Computing for Mixture-of-Experts via Multiplexing and Caching
Hanyuan Gao, Xiaoxuan Yang

TL;DR
This paper introduces an area-efficient in-memory computing architecture for Mixture-of-Experts transformers, leveraging multiplexing, expert grouping, and caching to significantly improve area, performance, and energy efficiency on process-in-memory hardware.
Contribution
It proposes a novel multiplexing and caching strategy that reduces area overhead and enhances efficiency for MoE transformers on PIM architectures.
Findings
Area efficiency improved by up to 2.2x over state-of-the-art.
Performance and energy efficiency during generation increased by 4.2x and 10.1x.
Total performance density reached 15.6 GOPS/W/mm2.
Abstract
Mixture-of-Experts (MoE) layers activate a subset of model weights, dubbed experts, to improve model performance. MoE is particularly promising for deployment on process-in-memory (PIM) architectures, because PIM can naturally fit experts separately and provide great benefits for energy efficiency. However, PIM chips often suffer from large area overhead, especially in the peripheral circuits. In this paper, we propose an area-efficient in-memory computing architecture for MoE transformers. First, to reduce area, we propose a crossbar-level multiplexing strategy that exploits MoE sparsity: experts are deployed on crossbars and multiple crossbars share the same peripheral circuits. Second, we propose expert grouping and group-wise scheduling methods to alleviate the load imbalance and contention overhead caused by sharing. In addition, to address the problem that the expert choice router…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing · Parallel Computing and Optimization Techniques
