Tasa: Thermal-aware 3D-Stacked Architecture Design with Bandwidth Sharing for LLM Inference
Siyuan He, Peiran Yan, Yandong He, Youwei Zhuo, Tianyu Jia

TL;DR
Tasa introduces a thermal-aware 3D-stacked architecture with bandwidth sharing for LLM inference, significantly improving scalability, thermal management, and performance over existing solutions.
Contribution
The paper presents Tasa, a heterogeneous 3D-stacked architecture with thermal optimization and bandwidth sharing, enhancing LLM inference efficiency and scalability.
Findings
Up to 5.55°C peak temperature reduction in 48-core configurations.
Achieved 2.85x and 2.21x speedup over GPU baselines for Llama-65B and GPT-3 66B.
Demonstrated improved thermal scalability and inference performance.
Abstract
The autoregressive decoding in LLMs is the major inference bottleneck due to the memory-intensive operations and limited hardware bandwidth. 3D-stacked architecture is a promising solution with significantly improved memory bandwidth, which vertically stacked multi DRAM dies on top of logic die. However, our experiments also show the 3D-stacked architecture faces severer thermal issues compared to 2D architecture, in terms of thermal temperature, gradient and scalability. To better exploit the potential of 3D-stacked architecture, we present Tasa, a heterogeneous architecture with cross-stack thermal optimizations to balance the temperature distribution and maximize the performance under the thermal constraints. High-performance core is designed for compute-intensive operations, while high-efficiency core is used for memory-intensive operators, e.g. attention layers. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvancements in Photolithography Techniques · Semiconductor materials and devices · Parallel Computing and Optimization Techniques
