Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models
Huwan Peng, Scott Davidson, Richard Shi, Shuaiwen Leon Song, Michael, Taylor

TL;DR
This paper introduces Chiplet Cloud, a scalable, cost-efficient ASIC architecture for large language model serving, achieving significant TCO improvements over traditional GPU and TPU cloud solutions.
Contribution
It proposes a novel chiplet-based architecture with a specialized memory system and a co-design methodology to optimize LLM serving performance and cost.
Findings
97x TCO reduction compared to rented GPU clouds
18x TCO reduction compared to rented TPU clouds
Supports 1.7x larger models with 60% sparsity
Abstract
Large language models (LLMs) such as OpenAI's ChatGPT and Google's Gemini have demonstrated unprecedented capabilities of autoregressive AI models across multiple tasks triggering disruptive technology innovations around the world. However, as models continue to grow the cost to serve these models also continues to grow threatening the democratization of LLMs. To address this issue, we propose Chiplet Cloud, a chiplet-based ASIC LLM-supercomputer architecture whose goal is to optimize the total cost of ownership (TCO) per generated token. This architecture is a highly parameterizable ASIC and server-level architecture leveraging thousands of replicated accelerator modules collaborating to scale-up the performance of LLMs at cloud-scale. To determine specific parameterizations of the Chiplet Cloud architecture, we implemented a two-phase hardware-software co-design methodology that can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Topic Modeling · Natural Language Processing Techniques
