SparkGOR: A unified framework for genomic data analysis
Sigmar K. Stef\'ansson, H\'akon Gu{\dh}bjartsson

TL;DR
SparkGOR unifies Spark and GOR into a single framework, enabling efficient large-scale genomic data analysis through a relational query engine that integrates SQL and GORpipe functionalities.
Contribution
It introduces a unified query engine combining SparkSQL and GORpipe, supporting embedded SQL, virtual relations, and compatibility with preferred file formats for genomic data analysis.
Findings
Created a relational query engine uniting SparkSQL and GORpipe.
Enabled embedding of SQL in GOR and support for virtual relations.
Provided APIs for GOR with Spark dataframes.
Abstract
Motivation: Our goal was to combine the capabilities of Spark and GOR into a single computing framework for use in analysis of large scale genome data. Results: We have created a relational query engine that unites SparkSQL and GORpipe into a single declarative query framework. This has been achieved by allowing embedding of SQL expressions into the high-level relational statement syntax in GOR and by supporting virtual relations and nested GORpipe expressions within SQL. Furthermore, we have built drivers to enable Spark and GOR to use and leverage their preferred file formats, Parquet and GORZ respectively, and introduced APIs to allow the use of GOR with Spark dataframes. Availability: The SparkGOR version of the GORpipe software is open-source and freely available at https://gorpipe-website.now.sh and https://github.com/gorpipe.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Genomics and Phylogenetic Studies · Algorithms and Data Compression
