Indexing arbitrary-length $k$-mers in sequencing reads

Tomasz Kowalski; Szymon Grabowski; Sebastian Deorowicz

arXiv:1502.01861·cs.DS·March 3, 2017

Indexing arbitrary-length $k$-mers in sequencing reads

Tomasz Kowalski, Szymon Grabowski, Sebastian Deorowicz

PDF

TL;DR

This paper introduces PgSA, a lightweight in-memory data structure for efficient indexing and querying of arbitrary-length k-mers in sequencing reads, supporting key bioinformatics applications.

Contribution

The paper presents PgSA, a novel pseudogenome suffix array that efficiently indexes and queries NGS reads, outperforming existing methods in space and time.

Findings

01

PgSA is competitive in space and query time.

02

Supports counting and locating k-mers.

03

Applicable to variant calling and RNA-seq analysis.

Abstract

We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating $k$ -mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.