Design and evaluation of a genomics variant analysis pipeline using GATK   Spark tools

Nicholas Tucci; Jacek Cala; Jannetta Steyn; Paolo Missier

arXiv:1806.00788·cs.DC·June 5, 2018·1 cites

Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools

Nicholas Tucci, Jacek Cala, Jannetta Steyn, Paolo Missier

PDF

Open Access

TL;DR

This paper presents the design and evaluation of a scalable genomics variant analysis pipeline using GATK Spark tools, demonstrating deployment and performance analysis over a cluster with Docker, and comparing costs to Microsoft Genomics Services.

Contribution

It introduces a pipeline implementation with GATK 4.0 Spark tools, highlighting deployment strategies and performance insights for scalable genome analysis.

Findings

01

Comparable processing times to Microsoft Genomics Services

02

Cost-effective deployment on clusters using Docker

03

Preliminary performance analysis of GATK Spark pipeline

Abstract

Scalable and efficient processing of genome sequence data, i.e. for variant discovery, is key to the mainstream adoption of High Throughput technology for disease prevention and for clinical use. Achieving scalability, however, requires a significant effort to enable the parallel execution of the analysis tools that make up the pipelines. This is facilitated by the new Spark versions of the well-known GATK toolkit, which offer a black-box approach by transparently exploiting the underlying Map Reduce architecture. In this paper we report on our experience implementing a standard variant discovery pipeline using GATK 4.0 with Docker-based deployment over a cluster. We provide a preliminary performance analysis, comparing the processing times and cost to those of the new Microsoft Genomics Services.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Genomics and Phylogenetic Studies · Advanced Data Storage Technologies