Parallel Prefix Sum with SIMD
Wangda Zhang, Yanbin Wang, Kenneth A. Ross

TL;DR
This paper explores SIMD and multithreaded methods for prefix sum computation, proposing a partitioning technique that significantly improves performance and reduces memory bandwidth bottlenecks.
Contribution
It introduces a novel partitioning approach for prefix sums that enhances data locality and performance in SIMD and multithreaded environments.
Findings
Partitioning data into cache-sized blocks improves speed.
The proposed method is up to 3x faster than existing implementations.
Different computation organizations offer trade-offs in performance and usability.
Abstract
The prefix sum operation is a useful primitive with a broad range of applications. For database systems, it is a building block of many important operators including join, sort and filter queries. In this paper, we study different methods of computing prefix sums with SIMD instructions and multiple threads. For SIMD, we implement and compare horizontal and vertical computations, as well as a theoretically work-efficient balanced tree version using gather/scatter instructions. With multithreading, the memory bandwidth can become the bottleneck of prefix sum computations. We propose a new method that partitions data into cache-sized smaller partitions to achieve better data locality and reduce bandwidth demands from RAM. We also investigate four different ways of organizing the computation sub-procedures, which have different performance and usability characteristics. In the experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization · Advanced Database Systems and Queries
