# Fast, parallel, and cache-friendly suffix array construction

**Authors:** Jamshed Khan, Tobias Rubel, Erin Molloy, Laxman Dhulipala, Rob Patro

PMC · DOI: 10.1186/s13015-024-00263-5 · 2024-04-28

## TL;DR

This paper introduces a new fast and efficient algorithm for building suffix arrays, which are important in bioinformatics, and makes it publicly available.

## Contribution

The novel contribution is a scalable parallel algorithm, caps-sa, with improved performance and memory efficiency for suffix array construction.

## Key findings

- caps-sa outperforms existing state-of-the-art parallel suffix array construction algorithms on modern hardware.
- The algorithm achieves strong performance due to excellent memory locality and fewer cache misses.

## Abstract

String indexes such as the suffix array (sa) and the closely related longest common prefix (lcp) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize.

In this paper we present caps-sa, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort and utilizing an LCP-informed mergesort. Due to its design, caps-sa has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies.

We show that despite its simple design, caps-sa outperforms existing state-of-the-art parallel sa and lcp-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context sa and show that caps-sa can easily be extended to exploit this structure to obtain further speedups. We make our code publicly available at https://github.com/jamshed/CaPS-SA.

## Full-text entities

- **Genes:** ACSM3 (acyl-CoA synthetase medium chain family member 3) [NCBI Gene 6296] {aka SA, SAH}
- **Diseases:** RP (MESH:D012174)
- **Chemicals:** CdBG (-), SA (MESH:D000077145)
- **Species:** Carcharodon carcharias (great white shark, species) [taxon 13397], Homo sapiens (human, species) [taxon 9606], Ambystoma mexicanum (axolotl, species) [taxon 8296], Salmonella enterica (species) [taxon 28901], Sepiidae (cuttlefishes, family) [taxon 6608]
- **Mutations:** T2T
- **Cell lines:** CHM13 — Homo sapiens (Human), Hydatidiform mole, Telomerase immortalized cell line (CVCL_VU12)

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11056320/full.md

---
Source: https://tomesphere.com/paper/PMC11056320