BEETL-fastq: a searchable compressed archive for DNA reads
Lilian Janin, Ole Schulz-Trieglaff, Anthony J. Cox

TL;DR
BEETL-fastq is a tool that compresses DNA sequencing FASTQ files more efficiently than gzip and enables rapid k-mer searches within the compressed archive, facilitating quick data retrieval for genomic analysis.
Contribution
It introduces a novel compression and search method for FASTQ files that allows direct querying without decompression, improving efficiency in genomic data handling.
Findings
Compresses 6.6 TB of human FASTQ data to 1.7 TB with indexing
Searches for 30-mers in seconds to minutes depending on quantity
Enables applications like structural variant genotyping and targeted read extraction
Abstract
Motivation: FASTQ is a standard file format for DNA sequencing data which stores both nucleotides and quality scores. A typical sequencing study can easily generate hundreds of gigabytes of FASTQ files, while public archives such as ENA and NCBI and large international collaborations such as the Cancer Genome Atlas can accumulate many terabytes of data in this format. Text compression tools such as gzip are often employed to reduce the storage burden, but have the disadvantage that the data must be decompressed before it can be used. Here we present BEETL-fastq, a tool that not only compresses FASTQ-formatted DNA reads more compactly than gzip, but also permits rapid search for -mer queries within the archived sequences. Importantly, the full FASTQ record of each matching read or read pair is returned, allowing the search results to be piped directly to any of the many standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Algorithms and Data Compression · RNA and protein synthesis mechanisms
