Efficient Hardware Accelerator Based on Medium Granularity Dataflow for   SpTRSV

Qian Chen; Xiaofeng Yang; Shengli Lu

arXiv:2406.10511·cs.DC·March 19, 2025

Efficient Hardware Accelerator Based on Medium Granularity Dataflow for SpTRSV

Qian Chen, Xiaofeng Yang, Shengli Lu

PDF

Open Access

TL;DR

This paper introduces a medium granularity dataflow hardware accelerator for SpTRSV that balances parallelism and data locality, achieving significant performance and energy efficiency improvements over CPUs, GPUs, and prior accelerators.

Contribution

It proposes a novel hardware-software co-designed medium granularity dataflow for SpTRSV, with caching and reordering techniques to optimize performance and data reuse.

Findings

01

Achieves up to 27.8× speedup over CPUs.

02

Achieves up to 98.8× speedup over GPUs.

03

Outperforms state-of-the-art DPU-v2 by 2.5× on average.

Abstract

Sparse triangular solve (SpTRSV) is widely used in various domains. Numerous studies have been conducted using CPUs, GPUs, and specific hardware accelerators, where dataflows can be categorized into coarse and fine granularity. Coarse dataflows offer good spatial locality but suffer from low parallelism, while fine dataflows provide high parallelism but disrupt the spatial structure, leading to increased nodes and poor data reuse. This paper proposes a novel hardware accelerator for SpTRSV or SpTRSV-like DAGs. The accelerator implements a medium granularity dataflow through hardware-software codesign and achieves both excellent spatial locality and high parallelism. Additionally, a partial sum caching mechanism is introduced to reduce the blocking frequency of processing elements (PEs), and a reordering algorithm of intra-node edges computation is developed to enhance data reuse.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSuperconducting Materials and Applications · Distributed and Parallel Computing Systems · Particle Detector Development and Performance