Benchmarking database performance for genomic data

Matloob Khushi

arXiv:2008.06835·cs.DB·August 18, 2020

Benchmarking database performance for genomic data

Matloob Khushi

PDF

TL;DR

This paper introduces RegMap, an SQL-based algorithm for genomic region operations, benchmarks database performance, and reveals biological insights into transcription factor co-localization.

Contribution

Developed a novel SQL-based algorithm for genomic region analysis and benchmarked database performance, providing a new tool and insights into genomic data management.

Findings

01

PostgreSQL outperforms MySQL in overlapping region extraction

02

PostgreSQL shows better data insertion and upload performance

03

HNF4G significantly co-locates with cohesin subunit STAG1 (SA1)

Abstract

Genomic regions represent features such as gene annotations, transcription factor binding sites and epigenetic modifications. Performing various genomic operations such as identifying overlapping/non-overlapping regions or nearest gene annotations are common research needs. The data can be saved in a database system for easy management, however, there is no comprehensive database built-in algorithm at present to identify overlapping regions. Therefore I have developed a region-mapping (RegMap) SQL-based algorithm to perform genomic operations and have benchmarked the performance of different databases. Benchmarking identified that PostgreSQL extracts overlapping regions much faster than MySQL. Insertion and data uploads in PostgreSQL were also better, although general searching capability of both databases was almost equivalent. In addition, using the algorithm pair-wise, overlaps of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.