Tasa: Thermal-aware 3D-Stacked Architecture Design with Bandwidth Sharing for LLM Inference

Siyuan He; Peiran Yan; Yandong He; Youwei Zhuo; Tianyu Jia

arXiv:2508.07252·cs.AR·November 20, 2025

Tasa: Thermal-aware 3D-Stacked Architecture Design with Bandwidth Sharing for LLM Inference

Siyuan He, Peiran Yan, Yandong He, Youwei Zhuo, Tianyu Jia

PDF

Open Access

TL;DR

Tasa introduces a thermal-aware 3D-stacked architecture with bandwidth sharing for LLM inference, significantly improving scalability, thermal management, and performance over existing solutions.

Contribution

The paper presents Tasa, a heterogeneous 3D-stacked architecture with thermal optimization and bandwidth sharing, enhancing LLM inference efficiency and scalability.

Findings

01

Up to 5.55°C peak temperature reduction in 48-core configurations.

02

Achieved 2.85x and 2.21x speedup over GPU baselines for Llama-65B and GPT-3 66B.

03

Demonstrated improved thermal scalability and inference performance.

Abstract

The autoregressive decoding in LLMs is the major inference bottleneck due to the memory-intensive operations and limited hardware bandwidth. 3D-stacked architecture is a promising solution with significantly improved memory bandwidth, which vertically stacked multi DRAM dies on top of logic die. However, our experiments also show the 3D-stacked architecture faces severer thermal issues compared to 2D architecture, in terms of thermal temperature, gradient and scalability. To better exploit the potential of 3D-stacked architecture, we present Tasa, a heterogeneous architecture with cross-stack thermal optimizations to balance the temperature distribution and maximize the performance under the thermal constraints. High-performance core is designed for compute-intensive operations, while high-efficiency core is used for memory-intensive operators, e.g. attention layers. Furthermore, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvancements in Photolithography Techniques · Semiconductor materials and devices · Parallel Computing and Optimization Techniques