# Direct construction of sparse suffix arrays with Libsais

**Authors:** Simon Van de Vyver, Tibo Vande Moortele, Peter Dawyndt, Bart Mesuere, Pieter Verschaffelt

PMC · DOI: 10.1186/s12859-025-06277-z · BMC Bioinformatics · 2025-10-17

## TL;DR

This paper introduces a method to directly build sparse suffix arrays using a text encoding technique, significantly reducing memory and time requirements for bioinformatics applications.

## Contribution

The novel contribution is a direct construction method for sparse suffix arrays using a text transformation that avoids building a full suffix array first.

## Key findings

- The method reduces memory usage and construction time by 50 to 75% for sparseness factors 3 or 4.
- Performance improvements are achievable for sparseness factors up to 8 depending on the alphabet size.
- The approach is especially effective for datasets with small alphabets like nucleotides or amino acids.

## Abstract

Pattern matching is a fundamental challenge in bioinformatics, especially in the fields of genomics, transcriptomics and proteomics. Efficient indexing structures, such as suffix arrays, are critical for searching large datasets. A sparse suffix array (SSA) retains only suffixes at every k-th position in the text, where k is the sparseness factor. While sparse suffix arrays offer significant memory savings compared to full suffix arrays, they typically still require the construction of a full suffix array prior to a sampling step, resulting in substantial memory overhead during the construction phase.

We present an alternative method to directly construct the sparse suffix array using a simple, yet powerful text encoding. This encoding reduces the input text length by grouping characters, thereby enabling direct SSA construction by extending the widely used Libsais library. This approach bypasses the need to construct a full suffix array, reducing memory usage and construction time by 50 to 75% when building a sparse suffix array with sparseness factor 3 or 4 for various nucleotide and amino acid datasets. Depending on the alphabet size, similar gains can be achieved for sparseness factors up to 8. For higher sparseness factors, comparable performance improvements can be obtained by constructing the SSA using a suitable divisor of the desired sparseness factor, followed by a subsampling step. The method is particularly effective for applications with small alphabets, such as a nucleotide or amino acid alphabet. An open-source implementation of this method is available on GitHub, enabling easy adoption for large-scale bioinformatics applications.

We introduce an efficient method for the construction of sparse suffix arrays for large datasets. Central to this approach is the introduction of a simple text transformation, which then serves as input to Libsais. This method reduces the length of both the input text and the resulting suffix array by a factor of k, which improves execution time and memory usage significantly.

The online version contains supplementary material available at 10.1186/s12859-025-06277-z.

## Full-text entities

- **Chemicals:** amino acid (MESH:D000596)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12535041/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12535041/full.md

## References

8 references — full list in the complete paper: https://tomesphere.com/paper/PMC12535041/full.md

---
Source: https://tomesphere.com/paper/PMC12535041