Area-Efficient In-Memory Computing for Mixture-of-Experts via Multiplexing and Caching

Hanyuan Gao; Xiaoxuan Yang

arXiv:2602.10254·cs.AR·February 12, 2026

Area-Efficient In-Memory Computing for Mixture-of-Experts via Multiplexing and Caching

Hanyuan Gao, Xiaoxuan Yang

PDF

Open Access

TL;DR

This paper introduces an area-efficient in-memory computing architecture for Mixture-of-Experts transformers, leveraging multiplexing, expert grouping, and caching to significantly improve area, performance, and energy efficiency on process-in-memory hardware.

Contribution

It proposes a novel multiplexing and caching strategy that reduces area overhead and enhances efficiency for MoE transformers on PIM architectures.

Findings

01

Area efficiency improved by up to 2.2x over state-of-the-art.

02

Performance and energy efficiency during generation increased by 4.2x and 10.1x.

03

Total performance density reached 15.6 GOPS/W/mm2.

Abstract

Mixture-of-Experts (MoE) layers activate a subset of model weights, dubbed experts, to improve model performance. MoE is particularly promising for deployment on process-in-memory (PIM) architectures, because PIM can naturally fit experts separately and provide great benefits for energy efficiency. However, PIM chips often suffer from large area overhead, especially in the peripheral circuits. In this paper, we propose an area-efficient in-memory computing architecture for MoE transformers. First, to reduce area, we propose a crossbar-level multiplexing strategy that exploits MoE sparsity: experts are deployed on crossbars and multiple crossbars share the same peripheral circuits. Second, we propose expert grouping and group-wise scheduling methods to alleviate the load imbalance and contention overhead caused by sharing. In addition, to address the problem that the expert choice router…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing · Parallel Computing and Optimization Techniques