SOFA: A Compute-Memory Optimized Sparsity Accelerator via Cross-Stage   Coordinated Tiling

Huizheng Wang; Jiahao Fang; Xinru Tang; Zhiheng Yue; Jinxi Li; Yubin; Qin; Sihan Guan; Qize Yang; Yang Wang; Chao Li; Yang Hu; Shouyi Yin

arXiv:2407.10416·cs.AR·July 16, 2024

SOFA: A Compute-Memory Optimized Sparsity Accelerator via Cross-Stage Coordinated Tiling

Huizheng Wang, Jiahao Fang, Xinru Tang, Zhiheng Yue, Jinxi Li, Yubin, Qin, Sihan Guan, Qize Yang, Yang Wang, Chao Li, Yang Hu, Shouyi Yin

PDF

Open Access

TL;DR

SOFA is a novel hardware-software co-designed accelerator that significantly improves the efficiency and speed of large language model inference by exploiting cross-stage coordination and optimized sparse computation techniques.

Contribution

The paper introduces SOFA, a compute-memory optimized accelerator with a cross-stage tiling principle and new algorithms for efficient sparse Transformer inference.

Findings

01

Achieves 9.5x speedup over Nvidia A100 GPU

02

Provides 71.5x higher energy efficiency than Nvidia A100

03

Outperforms 8 state-of-the-art accelerators in efficiency and speed

Abstract

Benefiting from the self-attention mechanism, Transformer models have attained impressive contextual comprehension capabilities for lengthy texts. The requirements of high-throughput inference arise as the large language models (LLMs) become increasingly prevalent, which calls for large-scale token parallel processing (LTPP). However, existing dynamic sparse accelerators struggle to effectively handle LTPP, as they solely focus on separate stage optimization, and with most efforts confined to computational enhancements. By re-examining the end-to-end flow of dynamic sparse acceleration, we pinpoint an ever-overlooked opportunity that the LTPP can exploit the intrinsic coordination among stages to avoid excessive memory access and redundant computation. Motivated by our observation, we present SOFA, a cross-stage compute-memory efficient algorithm-hardware co-design, which is tailored to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Neural Networks and Reservoir Computing · Cellular Automata and Applications