Benchmarking database performance for genomic data
Matloob Khushi

TL;DR
This paper introduces RegMap, an SQL-based algorithm for genomic region operations, benchmarks database performance, and reveals biological insights into transcription factor co-localization.
Contribution
Developed a novel SQL-based algorithm for genomic region analysis and benchmarked database performance, providing a new tool and insights into genomic data management.
Findings
PostgreSQL outperforms MySQL in overlapping region extraction
PostgreSQL shows better data insertion and upload performance
HNF4G significantly co-locates with cohesin subunit STAG1 (SA1)
Abstract
Genomic regions represent features such as gene annotations, transcription factor binding sites and epigenetic modifications. Performing various genomic operations such as identifying overlapping/non-overlapping regions or nearest gene annotations are common research needs. The data can be saved in a database system for easy management, however, there is no comprehensive database built-in algorithm at present to identify overlapping regions. Therefore I have developed a region-mapping (RegMap) SQL-based algorithm to perform genomic operations and have benchmarked the performance of different databases. Benchmarking identified that PostgreSQL extracts overlapping regions much faster than MySQL. Insertion and data uploads in PostgreSQL were also better, although general searching capability of both databases was almost equivalent. In addition, using the algorithm pair-wise, overlaps of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
