LaMoSys3.5D: Enabling 3.5D-IC-Based Large Language Model Inference Serving Systems via Hardware/Software Co-Design
Qipan Wang, Zhe Zhang, Shuangchen Li, Hongzhong Zheng, Zheng Liang, Yibo Lin, Runsheng Wang, Ru Huang

TL;DR
LaMoSys3.5D introduces a scalable 3.5D-IC architecture combining heterogeneous chiplets for efficient large language model inference, optimizing dataflow, parallel mapping, and thermal management for high throughput and energy efficiency.
Contribution
This work presents the first scalable 3.5D-IC architecture specifically designed for LLM inference serving, integrating hardware/software co-design for end-to-end efficiency.
Findings
62% throughput per watt improvement over DGXA100
4.87x better end-to-end latency compared to prior 3D designs
Effective design guidelines for 3.5D-IC architectures
Abstract
The success of large language models LLMs amplifies the need for highthroughput energyefficient inference at scale. 3DDRAMbased accelerators provide high memory bandwidth and therefore an opportunity to accelerate the bandwidthbound decode phase. However, how to adequately balance compute density for prefill with bandwidthcapacity for decode remains open. Moreover, most prior designs do not target endtoend serving, leaving the codesign of dataflow, parallel mapping, and scheduling underexplored. To bridge the gap, we present LaMoSys3.5D, to our knowledge the first scalable 3.5DIC architecture for LLM serving. LaMoSys3.5D composes heterogeneous 3DDRAM chiplets on a 2.5D interposer: computerich chiplets for prefill and bandwidthcapacityrich chiplets for decode. To realize efficient serving, we adopt a hardwaresoftware codesign spanning dataflow, parallel mapping, and introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Machine Learning in Materials Science · Natural Language Processing Techniques
