SparkGOR: A unified framework for genomic data analysis

Sigmar K. Stef\'ansson; H\'akon Gu{\dh}bjartsson

arXiv:2009.00061·cs.DB·September 2, 2020·1 cites

SparkGOR: A unified framework for genomic data analysis

Sigmar K. Stef\'ansson, H\'akon Gu{\dh}bjartsson

PDF

Open Access

TL;DR

SparkGOR unifies Spark and GOR into a single framework, enabling efficient large-scale genomic data analysis through a relational query engine that integrates SQL and GORpipe functionalities.

Contribution

It introduces a unified query engine combining SparkSQL and GORpipe, supporting embedded SQL, virtual relations, and compatibility with preferred file formats for genomic data analysis.

Findings

01

Created a relational query engine uniting SparkSQL and GORpipe.

02

Enabled embedding of SQL in GOR and support for virtual relations.

03

Provided APIs for GOR with Spark dataframes.

Abstract

Motivation: Our goal was to combine the capabilities of Spark and GOR into a single computing framework for use in analysis of large scale genome data. Results: We have created a relational query engine that unites SparkSQL and GORpipe into a single declarative query framework. This has been achieved by allowing embedding of SQL expressions into the high-level relational statement syntax in GOR and by supporting virtual relations and nested GORpipe expressions within SQL. Furthermore, we have built drivers to enable Spark and GOR to use and leverage their preferred file formats, Parquet and GORZ respectively, and introduced APIs to allow the use of GOR with Spark dataframes. Availability: The SparkGOR version of the GORpipe software is open-source and freely available at https://gorpipe-website.now.sh and https://github.com/gorpipe.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Genomics and Phylogenetic Studies · Algorithms and Data Compression