Mapping Space Exploration for Multi-Chiplet Accelerators Targeting LLM Inference Serving Workloads
Boyu Li, Zongwei Zhu, Yi Xiong, Qianyue Cao, Jiawei Geng, Xiaonan Zhang, Xi Li

TL;DR
This paper introduces the Compass framework for efficient mapping of multi-chiplet accelerators tailored for large language model inference, addressing dynamic request behaviors and improving energy-delay product.
Contribution
It presents a novel computation graph-based encoding scheme and a genetic algorithm-driven search framework specifically designed for LLM inference workloads.
Findings
Achieves an average 63.12% reduction in energy-delay product.
Supports dynamic mixed request types and variable sequence lengths.
Enables fine-grained execution control on heterogeneous chiplets.
Abstract
Large Language Models (LLMs) impose massive computational demands, driving the need for scalable multi-chiplet accelerators. However, existing mapping space exploration efforts for such accelerators primarily focus on traditional CNN/Transformer workloads and fail to adequately support the dynamic behaviors of mixed request types and variable sequence lengths in real-world LLM inference serving. To bridge this gap, we first propose a computation execution graph-based mapping encoding scheme that decouples micro-batches and layers, enabling fine-grained execution control on heterogeneous chiplets and flexibly representing various parallelism strategies. Second, building upon this scheme, we develop the Compass framework, which integrates an evaluation engine and a genetic algorithm-based mapping generation engine to achieve efficient mapping search. Compared to state-of-the-art works,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
