# Systolic Array-based Architecture for Low-Bit Integerized Vision Transformers

**Authors:** Ching-Yi Lin, and Sahil Shah

arXiv: 2508.20334 · 2025-08-29

## TL;DR

This paper presents a low-bit, systolic array-based hardware accelerator for vision transformers that significantly improves power efficiency and throughput compared to GPUs, enabling more energy-efficient AI inference.

## Contribution

The work introduces a specialized systolic array architecture optimized for low-bit integerized vision transformers, enhancing data reuse and reducing communication overhead.

## Key findings

- Achieves 96.83% accuracy on CIFAR-10 with 3-bit model
- Delivers 13,568 GOPs/s on FPGA, outperforming GPUs in power efficiency
- Offers 1.50x higher throughput and 4.47x better power efficiency than GTX 1080

## Abstract

Transformer-based models are becoming more and more intelligent and are revolutionizing a wide range of human tasks. To support their deployment, AI labs offer inference services that consume hundreds of GWh of energy annually and charge users based on the number of tokens processed. Under this cost model, minimizing power consumption and maximizing throughput have become key design goals for the inference hardware. While graphics processing units (GPUs) are commonly used, their flexibility comes at the cost of low operational intensity and limited efficiency, especially under the high query-per-model ratios of modern inference services.   In this work, we address these challenges by proposing a low-bit, model-specialized accelerator that strategically selects tasks with high operation (OP) reuse and minimal communication overhead for offloading. Our design incorporates multiple systolic arrays with deep, fine-grained pipelines and array-compatible units that support essential operations in multi-head self-attention (MSA) module. At the accelerator-level, each self-attention (SA) head is pipelined within a single accelerator to increase data reuse and further minimize bandwidth.   Our 3-bit integerized model achieves 96.83% accuracy on CIFAR-10 and 77.81% top-1 accuracy on ImageNet. We validate the hardware design on a 16nm FPGA (Alveo U250), where it delivers 13,568 GigaOps/second (GOPs/s) and 219.4 GOPs/s/W. Compared to a same-technology GPU (GTX 1080), our design offers 1.50x higher throughput and 4.47x better power efficiency. Even against a state-of-the-art GPU (RTX 5090), we still achieve 20% better power efficiency despite having 87% lower throughput.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20334/full.md

## Figures

19 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20334/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/2508.20334/full.md

---
Source: https://tomesphere.com/paper/2508.20334