Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM
Zhongkai Yu, Shengwen Liang, Tianyun Ma, Yunke Cai, Ziyuan Nan, Di, Huang, Xinkai Song, Yifan Hao, Jie Zhang, Tian Zhi, Yongwei Zhao, Zidong Du,, Xing Hu, Qi Guo, Tianshi Chen

TL;DR
Cambricon-LLM introduces a novel chiplet-based hybrid architecture combining NPU and NAND flash to enable efficient on-device inference of large 70B LLMs on edge devices, overcoming memory and bandwidth limitations.
Contribution
This work presents a new hybrid hardware architecture with hardware-tiling, in-flash computing, and on-die ECC techniques for efficient large language model inference on edge devices.
Findings
Achieves 3.44 token/s for 70B LLMs on edge devices.
Over 22X to 45X speedup over existing flash-offloading methods.
Demonstrates feasibility of deploying large LLMs on resource-constrained devices.
Abstract
Deploying advanced large language models on edge devices, such as smartphones and robotics, is a growing trend that enhances user data privacy and network connectivity resilience while preserving intelligent capabilities. However, such a task exhibits single-batch computing with incredibly low arithmetic intensity, which poses the significant challenges of huge memory footprint and bandwidth demands on limited edge resources. To address these issues, we introduce Cambricon-LLM, a chiplet-based hybrid architecture with NPU and a dedicated NAND flash chip to enable efficient on-device inference of 70B LLMs. Such a hybrid architecture utilizes both the high computing capability of NPU and the data capacity of the NAND flash chip, with the proposed hardware-tiling strategy that minimizes the data movement overhead between NPU and NAND flash chip. Specifically, the NAND flash chip, enhanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVLSI and Analog Circuit Testing · Advancements in Photolithography Techniques · Integrated Circuits and Semiconductor Failure Analysis
