Attention in SRAM on Tenstorrent Grayskull

Moritz Th\"uning

arXiv:2407.13885·cs.LG·July 22, 2024·2 cites

Attention in SRAM on Tenstorrent Grayskull

Moritz Th\"uning

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates how utilizing SRAM in the Tenstorrent Grayskull architecture for Transformer attention operations significantly speeds up computation, especially for Softmax, and compares its cost-effectiveness to GPUs.

Contribution

It introduces a fused kernel that maximizes SRAM use for attention, including a dedicated Softmax kernel and a CPU baseline, achieving notable speedups.

Findings

01

Softmax kernel speedup up to 10x over CPU

02

Fused kernel is 1.8x faster than dedicated Softmax

03

Grayskull is 30x cheaper than Nvidia H100 and has 1.5x more SRAM

Abstract

When implementations of the Transformer's self-attention layer utilize SRAM instead of DRAM, they can achieve significant speedups. The Tenstorrent Grayskull architecture provides a large SRAM, distributed across a grid of cores. This work presents a fused kernel for Grayskull, that exclusively utilizes its large SRAM by combining matrix multiplication, attention score scaling and Softmax operations. Additionally, a dedicated Softmax kernel utilizing the SRAM and a CPU implementation serving as a baseline are presented. The Softmax operation consumes most of the runtime in the computation of attention weights from queries and keys on Grayskull. The speedup of the dedicated Softmax kernel compared to the CPU implementation is up to $10 \times$ , and the Softmax implementation inside the fused kernel is approximately $1.8 \times$ faster than the dedicated Softmax kernel. The time and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

moritztng/grayskull-attention
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputer Science and Engineering · Low-power high-performance VLSI design · VLSI and FPGA Design Techniques

MethodsAttention Is All You Need · Softmax