From Buffers to Registers: Unlocking Fine-Grained FlashAttention with Hybrid-Bonded 3D NPU Co-Design

Jinxin Yu; Yudong Pan; Mengdi Wang; Huawei Li; Yinhe Han; Xiaowei Li; Ying Wang

arXiv:2602.11016·cs.AR·February 12, 2026

From Buffers to Registers: Unlocking Fine-Grained FlashAttention with Hybrid-Bonded 3D NPU Co-Design

Jinxin Yu, Yudong Pan, Mengdi Wang, Huawei Li, Yinhe Han, Xiaowei Li, Ying Wang

PDF

Open Access

TL;DR

This paper introduces a novel 3D-stacked accelerator architecture and a fine-grained attention scheduling method to significantly reduce energy consumption and improve speed in Transformer workloads by overcoming on-chip SRAM bottlenecks.

Contribution

It presents 3D-Flow, a hybrid-bonded 3D accelerator, and 3D-FlashAttention, a scheduling technique, to enhance energy efficiency and performance in Transformer model acceleration.

Findings

01

Reduces 46-93% energy consumption in Transformer workloads.

02

Achieves 1.4x-7.6x speedup over existing designs.

03

Enables bubble-free vertical dataflow with minimal overhead.

Abstract

Transformer-based models dominate modern AI workloads but exacerbate memory bottlenecks due to their quadratic attention complexity and ever-growing model sizes. Existing accelerators, such as Groq and Cerebras, mitigate off-chip traffic with large on-chip caches, while algorithmic innovations such as FlashAttention fuse operators to avoid materializing large attention matrices. However, as off-chip traffic decreases, our measurements show that on-chip SRAM accesses account for over 60% of energy in long-sequence workloads, making cache access the new bottleneck. We propose 3D-Flow, a hybrid-bonded, 3D-stacked spatial accelerator that enables register-to-register communication across vertically partitioned PE tiers. Unlike 2D multi-array architectures limited by NoC-based router-to-router transfers, 3D-Flow leverages sub-10 um vertical TSVs to sustain cycle-level operator pipelining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInterconnection Networks and Systems · Parallel Computing and Optimization Techniques · Embedded Systems Design Techniques