SWAT: Scalable and Efficient Window Attention-based Transformers   Acceleration on FPGAs

Zhenyu Bai; Pranav Dangi; Huize Li; Tulika Mitra

arXiv:2405.17025·cs.AR·May 28, 2024

SWAT: Scalable and Efficient Window Attention-based Transformers Acceleration on FPGAs

Zhenyu Bai, Pranav Dangi, Huize Li, Tulika Mitra

PDF

Open Access

TL;DR

This paper introduces SWAT, an FPGA-based accelerator that efficiently handles long-context Transformer models using window attention, achieving significant improvements in latency and energy efficiency.

Contribution

The paper presents a novel dataflow-aware FPGA microarchitecture optimized for sparse window attention, enabling scalable and efficient long-input processing.

Findings

01

Up to 22× latency reduction compared to baseline FPGA accelerators.

02

Up to 5.7× energy efficiency improvement over baseline FPGA.

03

15× energy efficiency gain over GPU-based solutions.

Abstract

Efficiently supporting long context length is crucial for Transformer models. The quadratic complexity of the self-attention computation plagues traditional Transformers. Sliding window-based static sparse attention mitigates the problem by limiting the attention scope of the input tokens, reducing the theoretical complexity from quadratic to linear. Although the sparsity induced by window attention is highly structured, it does not align perfectly with the microarchitecture of the conventional accelerators, leading to suboptimal implementation. In response, we propose a dataflow-aware FPGA-based accelerator design, SWAT, that efficiently leverages the sparsity to achieve scalable performance for long input. The proposed microarchitecture is based on a design that maximizes data reuse by using a combination of row-wise dataflow, kernel fusion optimization, and an input-stationary design…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLow-power high-performance VLSI design · VLSI and Analog Circuit Testing · Analog and Mixed-Signal Circuit Design

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections